版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、Provable Gradient Variance Guarantees for Black-Box Variational Inference Justin Domke College of Information and Computer Sciences University of Massachusetts Amherst Abstract Recent variational inference methods use stochastic gradient estimators whose variance is not well unders
2、tood. Theoretical guarantees for these estimators are important to understand when these methods will or will not work. This paper gives bounds for the common “reparameterization” estimators when the target is smooth and the variational family is a location-scale distribution. These bounds are unimp
3、rovable and thus provide the best possible guarantees under the stated assumptions. 1Introduction Take a distributionp(z,x)representing relationships between dataxand latent variablesz. After observingx, one might wish to approximate the marginal probabilityp(x)or the posteriorp(z|x). Variational in
4、ference (VI) is based on the simple observation that for any distribution q(z), logp(x) = E zq log p(z,x) q(z) |z ELBO(q) +KL(q(z)kp(z|x).(1) VI algorithms typically choose an approximating familyqwand maximizeELBO(qw)overw. Sincelogp(x) is fi xed, this simultaneously tightens a lower-bound onlogp(x
5、)and minimizes the divergence from qw(z) to the posterior p(z|x). Traditional VI algorithms supposepandqware simple enough for certain expectations to have closed forms, leading to deterministic coordinate-ascent type algorithms 6,1,20. Recent work has turned towards stochastic optimization. There a
6、re two motivations for this. First, stochastic data subsampling can give computational savings 7. Second, more complex distributions can be addressed ifpis treated as a “black box”, with no expectations available 9,15,19. In both cases, one can still estimate a stochastic gradient of the ELBO 17 and
7、 thus use stochastic gradient optimization. It is possible to address very complex and large-scale problems using this strategy 10. These improvements in scale and generality come at a cost: Stochastic optimization is typically less reliable than deterministic coordinate ascent. Convergence is often
8、 a challenge, and methods typically use heuristics for parameters like step-sizes. Failures do frequently occur in practice 22, 11, 4. To help understand when black-box VI can be expected to work, this paper investigates the variance of gradient estimates. This is a major issue in practice, and many
9、 ideas have been proposed to attempt to reduce the variance 8,5,12,2,18,13,14,16. Despite all this, few rigorous guarantees on the variance of gradient estimators seem to be known (Sec. 5.1). 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. 1.1Contributions
10、 This paper studies “reparameterization” (RP) or “path” based gradient estimators whenqwis in a multivariate location-scale family. We decomposeELBO(qw) = l(w) + h(w)whereh(w)is the entropy ofqw(known in closed-form) andl(w) = Ezqwlogp(z,x).The key assumption is that logp(z,x)is (Lipschitz) smooth a
11、s a function ofz, meaning thatrzlogp(z,x)cant change too quickly as z changes. Formally f(z) is M-smooth if krf(z) ? rf(z0)k2 Mkz ? z0k2. Bound for smooth target distributions:Ifgis the RP gradient estimator ofrl(w)andlogpis M-smooth, thenEkgk2is bounded by a quadratic function ofw(Thm. 3). With a s
12、mall relaxation, this isEkgk2 aM2kw ? wk2(Eq. (3) where w are fi xed parameters anda is determined by the location-scale family. Generalized bound:We extend this result to consider a more general notion of “matrix” smoothness (Thm. 5) refl ecting that the sensitivity ofrzlogp(z,x)to changes inzmay d
13、epend on the direction of change. Data Subsampling:We again extend this result to consider data subsampling (Thm. 6). In particular, we observe that non-uniform subsampling gives tighter bounds. In all cases, we show that the bounds are unimprovable. We experimentally compare these bounds to empiric
14、al variance. 2Setup Given some “black box” functionf, this paper studies estimating gradients of functionslof the form l(w) = Ezqwf(z).Now, suppose some base distributionsand mappingTware known such that if u s, then Tw(u) qw. Then, l can be written as l(w) = E us f(Tw(u). If we defi neg = rwf(Tw(u)
15、,thengis an unbiased estimate ofrl, i.e.Eg = rl(w).The same idea can be used whenf is composed as a fi nite sum asf(z) = PN n=1fn(z).IfN is large, even evaluatingfonce might beexpensive. However, take any positive distributionovern 2 1, ,N and samplen independently ofu . Then, if we defi neg = rw(n)
16、?1fn(Tw(u), this is again an unbiased estimator with Eg = rl(w). Convergence rates in stochastic optimization depend on the variability of the gradient estimator, typically either via the expected squared norm (ESN)Ekgk2 2or the trace of the variancetrVg.These are closely related, since Ekgk2 2= trV
17、g + kEgk22. The goal of this paper is to bound the variability ofgfor reparameterization / path estimators ofg. This requires making assumptions about (i) the transformation functionTwand base distributions (which determine qw) and (ii) the target function f. Here, we are interested in the case of a
18、ffi ne mappings. We use the mapping17 Tw(u) = Cu + m, wherew = (m,C)is a single vector of all parameters. This is the most common mapping used to represent location-scale families. That is, ifu sthenTw(u)is equal in distribution to a location-scale family distribution. For example, ifs = N(0,I)thenT
19、w(u)is equal in distribution to N(m,CC). We will refer to the base distribution asstandardizedif the components ofu = (u1, ,ud) sare iid withEu1= Eu3 1= 0andVu1= 1.The bounds will depend on the fourth moment = Eu41, but are otherwise independent of s. To apply these estimators to VI, choosef(z) = lo
20、g(z,x). ThenELBO(w) = l(w) + h(w) wherehis the entropy ofqw. Stochastic estimates of the gradient oflcan be employed in a stochastic gradient method to maximize the ELBO. To model the stochastic setting, suppose that X = (x1, ,xN)areiidandp(z,X) = p(z) QN n=1p(xn|z).Then, onemaychoose, e.g.fn(z) = 2
21、 1 N logp(z) + logp(xn|z).The entropyhis related to the (constant) entropy of the base distribution as h(w) = Entropy(s) + log|C|. The main bounds of this paper concern estimators for the gradient ofl(w)alone, disregardingh(w). There are two reasons for this. First, in location-scale families, the e
22、xact gradient ofh(w)is known. Second, if one uses a stochastic estimator forh(w),this can be “absorbed” intol(w)to some degree. This is discussed further in Sec. 5. 3Variance Bounds 3.1Technical Lemmas We begin with two technical lemmas which will do most of the work in the main results. Both have (
23、somewhat laborious) proofs in Sec. 7 (Appendix). The fi rst lemma relates the norm of the parameter gradient off(Tw(u)(with respect tow) to the norm of the gradient off(z)itself, evaluated at z = Tw(u). Lemma 1. For any w and u, krwf(Tw(u)k2 2 = krf(Tw(u)k2 2 1 + kuk2 2 . The proof is tedious but es
24、sentially amounts to calculating the derivative with respect to each component ofw(i.e. entriesmiandCij), summing the square of all entries, and simplifying. The second lemma gives a closed-form for the expectation of a closely related expression that will appear in the proof of Thm. 3 as a conseque
25、nce of applying Lem. 1. Lemma 2. Let u s for s standardized with u 2 Rdand Eusu4 i = . Then for any z, EkTw(u) ? zk2 2 ?1 + kuk2 2 ? = (d + 1)km ? zk2 2+ (d + )kCk 2 F. Again, the proof is tedious but based on simple ideas: Substitute the defi nition ofTwinto the left-hand side and expand all terms.
26、 This gives terms between zeroth and fourth order (inu). Calculating the exact expectation of each and simplifying using the assumption thatsis standardized gives the result. 3.2Basic Variance Bound Given these two lemmas, we give our major technical result, bounding the variability of a reparameter
27、ization-based gradient estimator. This will be later be extended to consider data subsam- pling, and a generalized notion of smoothness. Note that we do not require that f be convex. Theorem 3.SupposefisM-smooth, zis a stationary point off, andsis standardized withu 2 Rd and Eu4 i = . Let g = rwf (T
28、w(u) for u s. Then, Ekgk2 2 M2 (d + 1)km ? zk2 2+ (d + )kCk 2 F .(2) Moreover, this result is unimprovable without further assumptions. Proof. We expand the defi nition of g, and use the above lemmas and the smoothness of f. Ekgk2 2 = Ekrwf(Tw(u)k2 2 (Defi nition of g) = Ekrf(Tw(u)k2 2(1 + kuk 2 2)
29、(Lem. 1) = Ekrf(Tw(u) ? rf( z)k2 2(1 + kuk 2 2) (rf( z) = 0) EM2kTw(u) ? zk2 2(1 + kuk 2 2) (f is smooth) = M2 (d + 1)km ? zk2 2+ (d + )kCk 2 F .(Lem. 2) To see that this is unimprovable without further assumptions, observe that the only inequality is using the smoothness onfto bound the norm of the
30、 difference of gradients atTw(u)and at z. But for f(z) = M 2 kz ? zk2 2 this inequality is tight. Thus, for anyMand z, there is a functionfsatisfying the assumptions of the theorem such that Eq. (2) is an equality. 3 With a small amount of additional looseness, we can cast Eq. (2) into a more intuit
31、ive form. Defi ne w = ( z,0dd), where0ddis ad dmatrix of zeros. Then,kw ? wk22= km ? zk22+ kCk2F, so we can slightly relax Eq. (2) to the more user-friendly form of Ekgk2 2 (d + )M2kw ? wk2 2. (3) The only additional looseness is boundingd+1 d+ . This is justifi ed since whensis standardized, = u4 i
32、 is the kurtosis, which is at least one. Here,is determined bysand does not depend on the dimensionality. For example, ifsis Gaussian, = 3. Thus, Eq. (3) will typically not be much looser than Eq. (2). Intuitively, ware parameters that concentrateqentirely at a stationary point off. It is not hard t
33、o show thatkw ? wk2= Ezqwkz ? zk2.Thus, Eq. (3) intuitively says thatEkgk2is bounded in terms of how far far the average point sampled fromqwis from z. Sincefneed not be convex, there might be multiple stationary points. In this case, Thm. 3 holds simultaneously for all of them. 3.3Generalized Smoot
34、hness Since the above bound is not improvable, tightening it requires stronger assumptions. The tightness of Thm. 3 is determined by the smoothness condition that the difference of gradients at two points is bounded askrf(y) ? rf(z)k2 M ky ? zk2. For some problems,fmay be much smoother in certain di
35、rections than others. In such cases, the smoothness constantM will need to refl ect the worst-case direction. To produce a tighter bound for such situations, we generalize the notion of smoothness to allow M to be a symmetric matrix. Defi nition 4. f is M-matrix-smooth if krf(y) ? rf(z)k2 kM(y ? z)k
36、2(for symmetric M). We can generalize the result in Thm. 3 to functions with this matrix-smoothness condition. The proof is very similar. The main difference is that after applying the smoothness condition, the matrixM needs to be “absorbed” into the parameters w = (m,C) before applying Lem. 2. Theo
37、rem 5.SupposefisM-matrix smooth, zis a stationary point off, andsis standardized with u 2 Rdand Eu4 i = . Let g = rwf (Tw(u) for u s. Then, Ekgk2 2 (d + 1)kM(m ? z)k2 2+ (d + )kMCk 2 F .(4) Moreover, this result is unimprovable without further assumptions. Proof.The proof closely mirrors that of Thm
38、. 3. Here, givenw = (m,C), we defi nev = (Mm,MC), to be w with M “absorbed” into the parameters. Ekgk2 2 = Ekrwf(Tw(u)k2 2 Defi nition of g) = Ekrf(Tw(u)k2 2(1 + kuk 2 2) (Lem. 1) = Ekrf(Tw(u) ? rf( z)k2 2(1 + kuk 2 2) (rf( z) = 0) EkM (Tw(u) ? z)k2 2(1 + kuk 2 2) (f is smooth) = EkTv(u) ? M( z ? m)
39、k2 2(1 + kuk 2 2) (Absorb M into v) = (d + 1)kMm ? M zk2 2+ (d + )kMCk 2 F .(Lem. 2) To see that this is unimprovable, observe that the only inequality is the matrix-smoothness condition onf. But forf(z) = 1 2(z ? z) M(z ? z), the difference of gradientskrf(y) ? rf(z)k2= kM(y ? z)k2is an equality. T
40、hus, for anyMand z, there is anfsatisfying the assumptions of the theorem such that the bound in Eq. (4) is an equality. Its easy to see that this reduces to Thm. 3 in the case thatfis smooth in the standard sense this corresponds to the situation whereMis some constant times the identity. Alternati
41、vely, one can simply observe that the two results are the same ifMis a scalar. Thus, going forward we will use Eq. (4) to represent the result with either type of smoothness assumption on f. 4 3.4Subsampling Often, the functionf(z)takes the form of a sum over other functionsfn(z), typically represen
42、ting different data. Write this as f(z) = N X n=1 fn(z). To estimate the gradient ofEusf(Tw(u), one can save time by using “subsampling”: That is, draw a randomn, and then estimate the gradient ofEusfn(Tw(u). The following result bounds this procedure. It essentially just takes a set of estimators,
43、one corresponding to each functionfn, bounds their expected squared norm using the previous theorems, and then combines these. Theorem 6.SupposefnisMn-matrix-smooth, znis a stationary point offn, andsis standardized with u 2 Rdand Eu4 i = . Let g = 1 (n)rfn(Tw(u) for u s and n independent. Then, Ekg
44、k2 2 N X n=1 1 (n) (d + 1)kMn(m ? zn)k2 2+ (d + )kMnCk 2 F .(5) Moreover, this result is unimprovable without further assumptions. Proof.Consider a simple lemma: Supposea1aNare independent random vectors andis any distribution over1N.Letb = an/(n)forn , wherenis independent ofan.It is easy to show t
45、hatEb = PN n=1Ean andEkbk2 2 = P nEkank 2 2/(n).The result follows from applying this with an= rwfn(Tw(u), and then bounding Ekank2 2using Thm. 5. Again, in this result the only source of looseness is the use of the smoothness bound for the component functionsfn.Accordingly, the result can be shown
46、to be unimprovable: For any set of stationary points zand smoothness parametersMnwe can construct functionsfn(as in Thm. 5) for which the previous theorems are tight and thus this result is also tight. This result generalizes all the previous bounds: Thm. 5 is the special case whenN = 1, while Thm.
47、3 is the special-case when N = 1 and f1is M1-smooth (for a scalar M1). The case where N 1 but fnisMn-smooth (for scalarMn) is also useful the bound in Eq. (5) remains valid, but with a scalar Mn. 4Empirical Evaluation 4.1Model and Datasets We consider Bayesian linear regression and logistic regressi
48、on models on various datasets (Table 1). Given data(x1,y1),(xN,yN), letybe a vector of allynandXa matrix of allxn.We assume a Gaussian prior so thatp(z,y|X) = N(z|0,?2I) QN n=1p(yn|z,xn).For linear regression, p(yn|z,xn) = N(yn|zxi,2), while for logistic regression,p(yn|z,xn) = Sigmoid(ynzxn). For b
49、oth models we use a prior of ?2= 1. For linear regression, we set 2= 4. To justify the use of VI, apply the decomposition in Eq. (1) substitutingp(z,y|X)in place ofp(z,x) to get that logp(y|X) = E zq log p(z,y|X) q(z) + KL(q(z)kp(z|y,X). Thus, adjusting the parameters ofq to maximize the fi rst term
50、 on the right tightens a lower-bound on the conditional log likelihoodlogp(y|X)and minimizes the divergence fromqto the posterior. So, we again take our goal as maximizingl(w) + h(w). In the batch setting,f(z) = logp(z,y|X), while with subsampling, fn(z) = 1 N logp(z) + logp(yn|z,xn). 5 10.01000.0 k
51、 1e5 1e8 1e11 1e14 g 2 Grad Estimator batch uniform bound (matrix) bound (scalar) true mushrooms Figure 1:How loose are the bounds compared to reality?Odd Rows: Evolution of the ELBO during the single optimization trace used to compare all estimators. Even Rows: True and bounded variancewithgradient
52、sestimatedin“batch”(usingthefulldatasetineachevaluation)and“uniform” (stochastically with(n) = 1/N ). The fi rst two rows are for linear regression models, while the rest are for logistic regression. Key Observations: (i) Batch estimation is lower-variance but higher cost (ii) variance with stochast
53、ic estimation varies little over time (iii) using matrix smoothness signifi cantly tightens bounds and is exact for linear regression models. Sec. 8 shows that if0 ?00(t) ,then PN n=1?(a nz + bn)isM-matrix-smooth forM = PN i=1aia i . Applying this1gives that f(z) and fn(z) are matrix-smooth for M =
54、1 ?2 I + c N X n=1 xnx n, and Mn= 1 N?2 I + c xnx n, 1For linear regression, set ?(t) = ?t2/(22),an= xnandbn= ynand observe that?00= ?1/2. For logistic regression, set?(t) = logSigmoid(t),an= ynxnandbn= 0and observe that?00 1/4. Adding the prior and using the triangle inequality gives the result. 6
55、1e41e81e12 g 2 batch opt (matrix) opt (scalar) proportional uniform Grad Estimator 1.18e+09 9.35e+10 1.15e+11 1.15e+11 1.15e+11 2.42e+12 5.93e+13 4.79e+13 4.79e+13 4.79e+13 1.53e+05 1.03e+08 9.66e+07 8.9e+07 3.93e+07 1.53e+05 1.03e+08 9.66e+07 8.9e+07 3.93e+07 Bound Type bound (matrix) bound (scalar
56、) mushrooms Figure 2:Tightening variance bounds reduces true variance.A comparison of the true (vertical bars) and boundedEkgk2 values produced using fi ve different gradient estimators.Batchdoes not use subsampling.Uniformuses subsampling(n) = 1/N,proportionaluses(n) / Mn, opt (scalar)numerically o
57、ptimizes(n)to tighten Eq. (5) with a scalarMnandopt (matrix) tightens Eq. (5) with a matrixMn. For each sampling strategy, we show the variance bound both with a scalar and matrixMn. Uniform sampling has true and bounded values ofEkgk2ranging between 1.5x and 10 x higher than those for sampling with
58、 numerically optimized. DatasetType# data# dims bostonr50613 fi resr51712 cpusmallr819213 a1ac1695124 ionospherec35135 australianc69015 sonarc20861 mushroomsc8124113 Table 1: Regression (r) and classifi cation (c) datasets wherec = 1/2for linear regression, andc = 1/4 for logistic regression. Taking
59、 the spectral norm of these matrices gives scalar smoothness constants. With subsampling, this is kMnk2= 1 ?2N + ckxnk2. 4.2Evaluation of Bounds To enable a clear comparison of of different estima- tors and bounds, we generate a single optimization trace of parameter vectorswfor each dataset. All comparisons use this same trace. These use a
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年平頂山工業(yè)職業(yè)技術(shù)學(xué)院單招職業(yè)技能考試備考試題含詳細(xì)答案解析
- 2026年莆田學(xué)院高職單招職業(yè)適應(yīng)性測試模擬試題及答案詳細(xì)解析
- 2026年安徽交通職業(yè)技術(shù)學(xué)院單招職業(yè)技能考試備考題庫含詳細(xì)答案解析
- 2026年深圳職業(yè)技術(shù)學(xué)院單招綜合素質(zhì)考試參考題庫含詳細(xì)答案解析
- 2026年運(yùn)城職業(yè)技術(shù)大學(xué)單招綜合素質(zhì)考試備考題庫含詳細(xì)答案解析
- 2026年江西環(huán)境工程職業(yè)學(xué)院單招綜合素質(zhì)考試備考試題含詳細(xì)答案解析
- 2026年江蘇航空職業(yè)技術(shù)學(xué)院單招綜合素質(zhì)筆試備考題庫含詳細(xì)答案解析
- 2026云南紅河州瀘西大為焦化有限公司招聘2人考試重點(diǎn)題庫及答案解析
- 2026年資陽環(huán)境科技職業(yè)學(xué)院單招綜合素質(zhì)考試參考題庫含詳細(xì)答案解析
- 2026年安徽新聞出版職業(yè)技術(shù)學(xué)院高職單招職業(yè)適應(yīng)性測試備考試題及答案詳細(xì)解析
- 保密車間出入管理制度
- 肯德基副經(jīng)理養(yǎng)成課程
- 鐵路勞動安全 課件 第四章 機(jī)務(wù)勞動安全
- 智慧人社大數(shù)據(jù)綜合分析平臺整體解決方案智慧社保大數(shù)據(jù)綜合分析平臺整體解決方案
- 脊柱與四肢檢查課件
- 六宮格數(shù)獨(dú)100題
- 2024年河北省供銷合作總社招聘筆試參考題庫附帶答案詳解
- 宅基地及地上房屋確權(quán)登記申請審批表
- 醫(yī)療衛(wèi)生輿情課件
- 2024年甘肅省安全員A證考試題庫及答案
- 數(shù)據(jù)安全保護(hù)與隱私保護(hù)
評論
0/150
提交評論