版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、Almost Horizon-Free Structure-Aware Best Policy Identifi cation with a Generative Model Andrea Zanette Institute for Computational and Mathematical Engineering, Stanford University, CA Mykel J. Kochenderfer Department of Aeronautics and Astronautics, Stanford University, CA mykel
2、 Emma Brunskill Department of Computer Science, Stanford University, CA Abstract This paper focuses on the problem of computing an-optimal policy in a discounted Markov Decision Process (MDP) provided that we can access the reward and transition function through a ge
3、nerative model. We propose an algorithm that is initially agnostic to the MDP but that can leverage the specifi c MDP structure, expressed in terms of variances of the rewards and next-state value function, and gaps in the optimal action-value function to reduce the sample complexity needed to fi nd
4、 a good policy, precisely highlighting the contribution of each state-action pair to the fi nal sample complexity. A key feature of our analysis is that it removes all horizon dependencies in the sample complexity of suboptimal actions except for the intrinsic scaling of the value function and a con
5、stant additive term. 1Introduction A key goal is to design reinforcement learning (RL) agents that can leverage problem structure to effi ciently learn a good policy, especially in problems with very long time horizons. Ideally the RL algorithm should be able to adjust without apriori information ab
6、out the problem structure. Formal analyses that characterize the performance of such algorithms yielding instance-dependent bound help to advance our core understanding of the characteristics that govern the hardness of learning to make good decisions under uncertainty. Though there is relatively li
7、mited work in reinforcement learning, strong problem-dependent guar- antees are available for multi-armed bandits. In particular, well known bounds for online learning scale as a function of the gap between the expected reward of a particular action and the optimal action ABF02 and also on the varia
8、nce of the rewards AMS09. In the pure exploration setting in bandits, which is related to the setting we consider in this paper, there exist multiple algorithms with problem-dependent bounds EMM06; MM94; MSA08; Jam+14; BMS09; ABM10; GGL12; KKS13 of this form. Ideally the complexity of learning to ma
9、ke good decisions in reinforcement learning tasks would scale with previously identifi ed quantities of gap and variance over the next value function. As a step towards this, in this paper we introduce an algorithm for an RL agent operating in a discrete state and action space that has access to a g
10、enerative model and can leverage problem-dependent structure to have strong instance-dependent PAC sample complexity bounds that are a function of the variance of the rewards and next state value functions, as well as the gaps between the optimal and suboptimal state-action values. While the sequent
11、ial setting brings additional diffi culties due to possibly long horizon, our bounds also show that in the dominant terms, our 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. approach avoids suffering any horizon dependence for suboptimal actions beyond th
12、e scaling of the value function. This signifi cantly improves in statistical effi ciency over prior worst-case bounds for the generative model case GMK13; Sid+18 and matches existing worst-case bounds in worst-case settings. To do so we introduce a novel algorithm structure that acquires samples of
13、state-action pairs in iterative rounds. A slight variant of the well known simulation lemma (see e.g. KMN02) suggests that in order to improve our estimate of the optimal value function and policy, it is suffi cient to ensure that after each round of sampling, the confi dence intervals shrink over t
14、he MDP parameter estimates of both the stateaction pairs visited by the optimal policy and the stateaction pairs visited by the empirically-optimal policy. While of course both are unknown, we show that we can implicitly maintain a set of candidate policies that are -accurate, and by ensuring that w
15、e shrink the confi dence sets of all stateaction pairs likely to be visited by any such policy, we are also guaranteed (with high probability) to shrink the confi dence intervals of the optimal policy. Interestingly we can show that by focusing on such stateaction pairs, we can avoid the horizon dep
16、endence on suboptimal actions. The key idea is to take into account the MDP learned dynamics to enforce a constraint on the suboptimality of the candidate policies. The sampling strategy is derived by solving a minimax problem that minimizes the number of samples to guarantee that every policy in th
17、e set of candidate policies is accurately estimated. Importantly, this minimax problem can be reformulated as a convex minimization problem that can be solved with any standard solver for convex optimization. Our algorithmic approach is quite different from many prior approaches, both in the generat
18、ive model setting and the online setting. When a generative model is available, the available worst-case optimal algorithms AMK12; Sid+18 allocate samples uniformly to all state and action pairs. We show our approach can be substantialy more effective for general case of MDPs with heterogeneous stru
19、cture, and even for the pathologically hard instances because of the reduced horizon dependence on suboptimal actions. Note too that our approach is quite different from online RL algorithms that often (implicitly) allocate exploration budget to state-action pairs encountered by the policy with the
20、most optimistic upper bound JOA10; AOM17; OVR13; DB15; DLB17; SLL09; LH14, since here we explicitly reason about the reduction in the confi dence intervals across a large set of policies whose value is near the empirical optimal value at this round. 2Notation and Preliminaries We consider discounted
21、 infi nite horizon MDPs SB18, which are defi ned by a tupleM = hS,A,p,r,?i, whereSandAare the state and action spaces with cardinalitySandA, respec- tively. We denote byp(s0| s,a)the probability of transitioning to states0after taking actionain stateswhiler(s,a) 2 0,1is the average instantaneous rew
22、ard collected andR(s,a) 2 0,1the corresponding random variable. The vector value function of policyis denoted withV . If is the initial starting distribution thenV () = P ssV (s). The value function of the optimal policy ? is denoted withV ? = V ?. We callVarR(s,a)andVarp(s,a)V? the variance ofR(s,a
23、)and ofV ?(s0) wheres0 p(s,a). The agent interacts with the MDP via a generative model that takes as input a (s,a)pair and returns a random sample of the rewardR(s,a)and a random next states+according to the transition models+ p(s,a). The reinforcement learning agent maintains an empirical MDP c Mk=
24、 hS,A, b pk,b rk,?ifor every iteration/episodek, and the maximum likelihood transitionb pk(s,a) and rewardsb rk(s,a)have receivednk sasamples. The b V ? k is the empirical estimate using MDP c Mk of the empirical optimal policyb ? k. Variables with a hat refer to the empirical MDP c Mkand the subscr
25、ipt indicates what iteration/episode they refer to. We denote withw, sa = P1 t=0? tPr(s,a,t,) the discounted sum of visit probabilitiesPr(s,a,t,)to the(s,a)pair in timesteptif the starting state is drawn uniformly fromandbw ,k, sa is its analogous on c Mk. We use the O()notation to indicate a quanti
26、ty that depends on()up to apolylogexpression of a quantity at most polynomial inS,A, 1 1? 1 ?, where? is the “failure probability”. Before proceeding, we fi rst recall the following lemma from GMK13: 2 Lemma 2(Simulation Lemma for Optimal Value Function Estimate GMK13).With probability at least 1 ?
27、?, outside the failure event for any starting distribution it holds that: (V ? ?bV ? k)() X (s,a) b w ?,k, sa ?(r ? b r k)(s,a) + ?(p ? b pk)(s,a)V ? X (s,a) b w ?,k, sa CIsa(nk sa) (V ? ?bV ? k)() ? X (s,a) b w b ? k,k, sa ?(r ? b r k)(s,a) + ?(p ? b pk)(s,a)V ? ? ? X (s,a) b w b ? k,k, sa CIsa(nk
28、sa) TheCIsa(nk sa) are Bernsteins confi dence intervals (defi ned in more details in appendix A) afternk sa samples over the rewards and transitions and are function of the unknown rewards and transition variances. The proof (see appendix) is a slight variation of lemma 3 in GMK13. 3Sampling Strateg
29、y Given an Empirical MDP We fi rst describe how our approach will allocate samples to stateaction pairs given a current empirical MDP, before presenting in the next section our full algorithm. Lemma 1 suggests that to estimate the optimal value function it suffi ces to accurately estimate the (s,a)
30、pairs in the trajectories identifi ed by two policies, namely the optimal policy?(optimal onM) and the empirical optimal policyb ? k(optimal on c Mk). Since?andb ? kare unknown (in particular, b ? k is a random variable prior to sampling), a common strategy is to allocate an identical number of samp
31、les uniformly GMK13; Sid+18 so that the confi dence intervalsCIsa(nk sa) are suffi ciently small for all stateaction pairs leading to a small|(V ? ?bV ? k)()|; from here it is possible to show that the empirical optimal policyb ? kcan be extracted and that|(V ?Vb ? k)()|is also small (sob ? kis near
32、-optimal). Therefore, in the main text we mostly focus on showing that|(V ? ?bV ? k)()|is small, and defer additional details to the appendix. The proposed approach is to proceed in iterations or episodes. In each episode our algorithm implicitly maintains a set of candidate policies, which are near
33、-optimal, and allocates more samples to the(s,a) pairs visited by these policies to refi ne their estimated value. On the next episode those policies that are too suboptimal relative to their estimation accuracy are implicitly discarded. In particular, the samples are placed in a way that is related
34、 to the visit probabilities of the near-optimal empirical policies in addition to the variances of the reward and transitions of stateaction pairs encountered in potentially good policies. 3.1Oracle Minimax Program Suppose we have already allocated some samples and have computed the maximum likeliho
35、od MDP c Mkwith empirical optimal policyb ? k and know that the optimal value function estimate is at leastk-accurate, i.e.,kV ? ? b V ? kk1 k. How should we allocate further sampling resources to improve the accuracy in the optimal value function estimate? The idea is given by the simulation lemma
36、(lemma 2): in order to see an improvement after sampling (i.e., in the next episodek + 1) the maximum likelihood MDP c Mk+1 must have smaller confi dence intervalsCIsa(nk+1 sa )in the (s,a)pairs visited by?and the empirical optimal policyb ? k+1 on c Mk+1. Both are of course unknown. However, we int
37、roduce the constraint(bV ? k ? b V k )() Ckthat restricts sampling to Ck-optimal policies (and starting distributions) on c Mk. Here,Cis a numerical constant that will ensure that?andb ? k+1satisfy this condition and are therefore allocated enough samples. GivenC andk, the idea is that we should cho
38、ose a sampling strategynsasawith high enough samples to ensure P (s,a) b w ,k+1,CI sa(nk+1 sa ) k+1for all policies that satisfy(bV ? k ?bV k )() Ckso that Lemma 2 ensures |(V ? ?bV ? k+1)()| k+1 = k/2. This is equivalent to solving the following1: Defi nition 1 (Oracle Minimax Problem). min n max ,
39、 X (s,a) b w ,k+1, sa CIsa(nk+1 sa ),s.t.(bV ? k ?bV k)() Ck, X (s,a) nsa nmax.(1) 1For space, we omit the constraint s ? 0 and kk1= 1 on the starting distribution. 3 Here the vector of the discounted sum of visit probabilitiesbw ,k+1, is computable from the linear system(I?(bP k+1) )b w ,k+1, = and
40、nmaxis a guess on the number of samples needed to ensure that the objective function is k/2. We call this problem the oracle minimax problem because it uses the next-episode empirical visit probabilitiesbw ,k+1, which are not known. In addition, it uses the true variance of the next state value func
41、tion (embedded in the defi nition of confi dence intervals CIsa(nk sa). As these quantities are unknown in episode k, the program cannot be solved. 3.2Algorithm Minimax Program This section shows how to construct a minimax program that is close enough to the Oracle minimax problem (Equation 1) but i
42、s function of only empirical quantities computable from c Mk. The idea is 1) to avoid using the next-episode empirical distributionbw ,k+1, and instead use the currently- computablebw ,k, and 2) use the empirical variance of the next state value functionVarb pk(s,a)bV ? k instead of the real, unknow
43、n varianceVarp(s,a)V ?. Estimating the visit distributionb w ,k+1, accurately leads to a high sample complexity; fortunately we are able to claim that the product between the visit distribution shiftbw ,k+1, ?bw ,k, and the confi dence interval vectorCIk+1 on the rewards and transitions is already s
44、mall if policyhas received enough samples along its trajectories before the current episode. Let us rewrite the objective function of equation 1 as a function of the visit distribution on c Mkplus a term that takes into account the shift in the distribution from c Mkto c Mk+1: X (s,a) b w ,k+1, sa C
45、Isa(nk+1 sa ) = X (s,a) b w ,k, sa | z Computable CIsa(nk+1 sa ) + X (s,a) (bw ,k+1, sa ?bw ,k, sa ) |z Shift of Empirical Distributions CIsa(nk+1 sa ) Lemma9inappendixallowsustoclaimthattherightmostsummationaboveislessthan2Cp(nmin)k for both2 = ?andb ? k+1. HereCp(nmin) is defi ned in appendix A an
46、d can be made (see lemma 16) for example)b w ,k, = .(2) Here c CIsa(nk+1 sa ) are the confi dence intervals evaluated with the empirical variances defi ned in Appendix A. This program is fully expressed in terms of empirical quantities that depends on c Mk. As long as a solution to the above minimax
47、 program is k/2the oracle objective function will also 2Lemma 9 bounds this term as2Cp(nmin) kfor = ?, = b ? k+1, respectively; k is defi ned in appendix A and represents the “accuracy” of policyin episodek. To ensure k kwe need an inductive argument which is sketched out in the main theorem (Theore
48、m 1). 3 As we will shortly see, this will contribute only a constant term to the fi nal sample complexity. 4 be k/2at the solution of the program (for more details see Lemma 6 in the Appendix). In other words, by solving the minimax program (def 2) we put enough samples to satisfy the oracle program
49、 1, which ensures accuracy in the value function estimate through Lemma 2. 4Algorithm Algorithm 1 BESPOKE Input: Failure probability ? 0, accuracy Input 0 Set 1= 1 1? and allocate nminsamples to each (s,a) pair for k = 1,2,. for nmax= 20,21,22,. Solve the optimization program of defi nition 7 (appen
50、dix) if the optimal value of the program of defi nition 7 is k 2 Break and return sampling strategy nk+1 sa sa Query the generative model up to nk+1 sa ,8(s,a) Compute, b ? k+1and b V ? k+1 Set k+1= k 2 if k+1 Input Break and return the policy b ? k+1 We now take the sampling approach described in t
51、he previous section and use it to construct an iterative al- gorithm for quickly learning a near- optimal or optimal policy given access to a generative model. Specifi cally we present BESt POlicy identifi cation with no Knowledge of the Environ- ment (BESPOKE) in Algorithm 1. The algorithm proceeds
52、 in episodes. Each episode starts with an empirical MDP c Mkwhose optimal value functionbV ? k iskaccuratekV ? ?bV ? kk1 kun- der an inductive assumption. The sam- ples are allocated at each episodek by solving an optimization program equivalent to that in defi nition 2 to halve the accuracy in the
53、value function estimate, i.e.,kV ? ?bV ? k+1k1 k+1= k/2. In the innermost loop of the algorithm the required number of samples for the next episode is guessed nmax= 1,2,4,8,., untilnmax is suffi cient to ensure that the objective function of the minimax problem of defi nition 2 will be k/2; the purp
54、ose of the inner loop is to avoid putting more samples than needed and allows us to obtain the sample complexity result of Theorem 2. In Appendix G we reformulate the optimization program 2 (described more precisely in Defi nition 5 in the appendix) obtaining a convex minimization program that avoid
55、s optimizing over the policy and instead works directly with the distributionbw ,k, ; this can be effi ciently solved with standard techniques from convex optimization BV04. Theorem 1(BESPOKEWorks as Intended).With probability at least1 ? ?, in every episodek BESPOKEmaintains an empirical MDP c Mksu
56、ch that its optimal value function b V ? k and its optimal policy b ? k satisfy: kV ? ?bV ? kk1 k, kV ? ? V b ? kk1 2k wherek+1 def = k 2 , 8k. In particular, when BESPOKEterminates in episodekFinalit holds that Input 2 kFinal Input. The proof is reported in the appendix, and shows by induction that
57、 for every episodek,?and b ? k+1are in the set of candidate policies because they are near-optimal on c Mk, satisfying(bV ? k ? b V ? k )() Ckand(bV ? k ?bV b ? k+1 k )() Ckfor alland are therefore allocated enough samples; this guarantees accuracy in b V ? k+1by Lemma 2. 5Sample Complexity Analysis
58、 To analyze the sample complexity ofBESPOKEwe derive another optimization program function of only problem dependent quantities. We 1) shift from the empirical visit distributionbw ,k, on c Mkto the “real” visit distributionw,onM ; 2) move from empirical confi dence intervals to those evaluated with the true variances; and fi nally 3) relax the near-optimality constraint on the policies by using Lemma 7 in the appendix (for an appropriate numerical constantC? C) in order to be able to use the value functions on M: (bV ? k ?bV k)() Ck! (V ? ? V )() C?k, 8.(3) 5 In
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年電力系統(tǒng)運維安全規(guī)程
- 南城縣工創(chuàng)發(fā)展集團有限公司招聘考試備考試題及答案解析
- 2025年南安小學語文教招筆試及答案
- 2025年周三面試事業(yè)編武漢考試及答案
- 2026年安全事故的警鐘建筑工程案例
- 2026年工程項目中的環(huán)境友好型設(shè)計
- 2025年永年區(qū)事業(yè)單位考試真題及答案
- 2025年鹽源人事考試及答案
- 2025年機械類秋招筆試題庫及答案
- 2026年特殊教育支持策略培訓
- DB34T 4506-2023 通督調(diào)神針刺療法應(yīng)用指南
- 02-輸電線路各階段設(shè)計深度要求
- 《認識時鐘》大班數(shù)學教案
- 新疆維吾爾自治區(qū)伊犁哈薩克自治州2023-2024學年八年級下學期期中數(shù)學試題
- T-CI 178-2023 高大邊坡穩(wěn)定安全智能監(jiān)測預(yù)警技術(shù)規(guī)范
- THHPA 001-2024 盆底康復管理質(zhì)量評價指標體系
- 傷口的美容縫合減少瘢痕的形成
- MSOP(測量標準作業(yè)規(guī)范)測量SOP
- 顱鼻眶溝通惡性腫瘤的治療及護理
- 人教版四年級《上冊語文》期末試卷(附答案)
- 四川山體滑坡地質(zhì)勘察報告
評論
0/150
提交評論