iccv2019論文全集8325-provable-gradient-variance-guarantees-for-black-box-variational-inference_第1頁
iccv2019論文全集8325-provable-gradient-variance-guarantees-for-black-box-variational-inference_第2頁
iccv2019論文全集8325-provable-gradient-variance-guarantees-for-black-box-variational-inference_第3頁
iccv2019論文全集8325-provable-gradient-variance-guarantees-for-black-box-variational-inference_第4頁
iccv2019論文全集8325-provable-gradient-variance-guarantees-for-black-box-variational-inference_第5頁
已閱讀5頁,還剩5頁未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、Provable Gradient Variance Guarantees for Black-Box Variational Inference Justin Domke College of Information and Computer Sciences University of Massachusetts Amherst Abstract Recent variational inference methods use stochastic gradient estimators whose variance is not well unders

2、tood. Theoretical guarantees for these estimators are important to understand when these methods will or will not work. This paper gives bounds for the common “reparameterization” estimators when the target is smooth and the variational family is a location-scale distribution. These bounds are unimp

3、rovable and thus provide the best possible guarantees under the stated assumptions. 1Introduction Take a distributionp(z,x)representing relationships between dataxand latent variablesz. After observingx, one might wish to approximate the marginal probabilityp(x)or the posteriorp(z|x). Variational in

4、ference (VI) is based on the simple observation that for any distribution q(z), logp(x) = E zq log p(z,x) q(z) |z ELBO(q) +KL(q(z)kp(z|x).(1) VI algorithms typically choose an approximating familyqwand maximizeELBO(qw)overw. Sincelogp(x) is fi xed, this simultaneously tightens a lower-bound onlogp(x

5、)and minimizes the divergence from qw(z) to the posterior p(z|x). Traditional VI algorithms supposepandqware simple enough for certain expectations to have closed forms, leading to deterministic coordinate-ascent type algorithms 6,1,20. Recent work has turned towards stochastic optimization. There a

6、re two motivations for this. First, stochastic data subsampling can give computational savings 7. Second, more complex distributions can be addressed ifpis treated as a “black box”, with no expectations available 9,15,19. In both cases, one can still estimate a stochastic gradient of the ELBO 17 and

7、 thus use stochastic gradient optimization. It is possible to address very complex and large-scale problems using this strategy 10. These improvements in scale and generality come at a cost: Stochastic optimization is typically less reliable than deterministic coordinate ascent. Convergence is often

8、 a challenge, and methods typically use heuristics for parameters like step-sizes. Failures do frequently occur in practice 22, 11, 4. To help understand when black-box VI can be expected to work, this paper investigates the variance of gradient estimates. This is a major issue in practice, and many

9、 ideas have been proposed to attempt to reduce the variance 8,5,12,2,18,13,14,16. Despite all this, few rigorous guarantees on the variance of gradient estimators seem to be known (Sec. 5.1). 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. 1.1Contributions

10、 This paper studies “reparameterization” (RP) or “path” based gradient estimators whenqwis in a multivariate location-scale family. We decomposeELBO(qw) = l(w) + h(w)whereh(w)is the entropy ofqw(known in closed-form) andl(w) = Ezqwlogp(z,x).The key assumption is that logp(z,x)is (Lipschitz) smooth a

11、s a function ofz, meaning thatrzlogp(z,x)cant change too quickly as z changes. Formally f(z) is M-smooth if krf(z) ? rf(z0)k2 Mkz ? z0k2. Bound for smooth target distributions:Ifgis the RP gradient estimator ofrl(w)andlogpis M-smooth, thenEkgk2is bounded by a quadratic function ofw(Thm. 3). With a s

12、mall relaxation, this isEkgk2 aM2kw ? wk2(Eq. (3) where w are fi xed parameters anda is determined by the location-scale family. Generalized bound:We extend this result to consider a more general notion of “matrix” smoothness (Thm. 5) refl ecting that the sensitivity ofrzlogp(z,x)to changes inzmay d

13、epend on the direction of change. Data Subsampling:We again extend this result to consider data subsampling (Thm. 6). In particular, we observe that non-uniform subsampling gives tighter bounds. In all cases, we show that the bounds are unimprovable. We experimentally compare these bounds to empiric

14、al variance. 2Setup Given some “black box” functionf, this paper studies estimating gradients of functionslof the form l(w) = Ezqwf(z).Now, suppose some base distributionsand mappingTware known such that if u s, then Tw(u) qw. Then, l can be written as l(w) = E us f(Tw(u). If we defi neg = rwf(Tw(u)

15、,thengis an unbiased estimate ofrl, i.e.Eg = rl(w).The same idea can be used whenf is composed as a fi nite sum asf(z) = PN n=1fn(z).IfN is large, even evaluatingfonce might beexpensive. However, take any positive distributionovern 2 1, ,N and samplen independently ofu . Then, if we defi neg = rw(n)

16、?1fn(Tw(u), this is again an unbiased estimator with Eg = rl(w). Convergence rates in stochastic optimization depend on the variability of the gradient estimator, typically either via the expected squared norm (ESN)Ekgk2 2or the trace of the variancetrVg.These are closely related, since Ekgk2 2= trV

17、g + kEgk22. The goal of this paper is to bound the variability ofgfor reparameterization / path estimators ofg. This requires making assumptions about (i) the transformation functionTwand base distributions (which determine qw) and (ii) the target function f. Here, we are interested in the case of a

18、ffi ne mappings. We use the mapping17 Tw(u) = Cu + m, wherew = (m,C)is a single vector of all parameters. This is the most common mapping used to represent location-scale families. That is, ifu sthenTw(u)is equal in distribution to a location-scale family distribution. For example, ifs = N(0,I)thenT

19、w(u)is equal in distribution to N(m,CC). We will refer to the base distribution asstandardizedif the components ofu = (u1, ,ud) sare iid withEu1= Eu3 1= 0andVu1= 1.The bounds will depend on the fourth moment = Eu41, but are otherwise independent of s. To apply these estimators to VI, choosef(z) = lo

20、g(z,x). ThenELBO(w) = l(w) + h(w) wherehis the entropy ofqw. Stochastic estimates of the gradient oflcan be employed in a stochastic gradient method to maximize the ELBO. To model the stochastic setting, suppose that X = (x1, ,xN)areiidandp(z,X) = p(z) QN n=1p(xn|z).Then, onemaychoose, e.g.fn(z) = 2

21、 1 N logp(z) + logp(xn|z).The entropyhis related to the (constant) entropy of the base distribution as h(w) = Entropy(s) + log|C|. The main bounds of this paper concern estimators for the gradient ofl(w)alone, disregardingh(w). There are two reasons for this. First, in location-scale families, the e

22、xact gradient ofh(w)is known. Second, if one uses a stochastic estimator forh(w),this can be “absorbed” intol(w)to some degree. This is discussed further in Sec. 5. 3Variance Bounds 3.1Technical Lemmas We begin with two technical lemmas which will do most of the work in the main results. Both have (

23、somewhat laborious) proofs in Sec. 7 (Appendix). The fi rst lemma relates the norm of the parameter gradient off(Tw(u)(with respect tow) to the norm of the gradient off(z)itself, evaluated at z = Tw(u). Lemma 1. For any w and u, krwf(Tw(u)k2 2 = krf(Tw(u)k2 2 1 + kuk2 2 . The proof is tedious but es

24、sentially amounts to calculating the derivative with respect to each component ofw(i.e. entriesmiandCij), summing the square of all entries, and simplifying. The second lemma gives a closed-form for the expectation of a closely related expression that will appear in the proof of Thm. 3 as a conseque

25、nce of applying Lem. 1. Lemma 2. Let u s for s standardized with u 2 Rdand Eusu4 i = . Then for any z, EkTw(u) ? zk2 2 ?1 + kuk2 2 ? = (d + 1)km ? zk2 2+ (d + )kCk 2 F. Again, the proof is tedious but based on simple ideas: Substitute the defi nition ofTwinto the left-hand side and expand all terms.

26、 This gives terms between zeroth and fourth order (inu). Calculating the exact expectation of each and simplifying using the assumption thatsis standardized gives the result. 3.2Basic Variance Bound Given these two lemmas, we give our major technical result, bounding the variability of a reparameter

27、ization-based gradient estimator. This will be later be extended to consider data subsam- pling, and a generalized notion of smoothness. Note that we do not require that f be convex. Theorem 3.SupposefisM-smooth, zis a stationary point off, andsis standardized withu 2 Rd and Eu4 i = . Let g = rwf (T

28、w(u) for u s. Then, Ekgk2 2 M2 (d + 1)km ? zk2 2+ (d + )kCk 2 F .(2) Moreover, this result is unimprovable without further assumptions. Proof. We expand the defi nition of g, and use the above lemmas and the smoothness of f. Ekgk2 2 = Ekrwf(Tw(u)k2 2 (Defi nition of g) = Ekrf(Tw(u)k2 2(1 + kuk 2 2)

29、(Lem. 1) = Ekrf(Tw(u) ? rf( z)k2 2(1 + kuk 2 2) (rf( z) = 0) EM2kTw(u) ? zk2 2(1 + kuk 2 2) (f is smooth) = M2 (d + 1)km ? zk2 2+ (d + )kCk 2 F .(Lem. 2) To see that this is unimprovable without further assumptions, observe that the only inequality is using the smoothness onfto bound the norm of the

30、 difference of gradients atTw(u)and at z. But for f(z) = M 2 kz ? zk2 2 this inequality is tight. Thus, for anyMand z, there is a functionfsatisfying the assumptions of the theorem such that Eq. (2) is an equality. 3 With a small amount of additional looseness, we can cast Eq. (2) into a more intuit

31、ive form. Defi ne w = ( z,0dd), where0ddis ad dmatrix of zeros. Then,kw ? wk22= km ? zk22+ kCk2F, so we can slightly relax Eq. (2) to the more user-friendly form of Ekgk2 2 (d + )M2kw ? wk2 2. (3) The only additional looseness is boundingd+1 d+ . This is justifi ed since whensis standardized, = u4 i

32、 is the kurtosis, which is at least one. Here,is determined bysand does not depend on the dimensionality. For example, ifsis Gaussian, = 3. Thus, Eq. (3) will typically not be much looser than Eq. (2). Intuitively, ware parameters that concentrateqentirely at a stationary point off. It is not hard t

33、o show thatkw ? wk2= Ezqwkz ? zk2.Thus, Eq. (3) intuitively says thatEkgk2is bounded in terms of how far far the average point sampled fromqwis from z. Sincefneed not be convex, there might be multiple stationary points. In this case, Thm. 3 holds simultaneously for all of them. 3.3Generalized Smoot

34、hness Since the above bound is not improvable, tightening it requires stronger assumptions. The tightness of Thm. 3 is determined by the smoothness condition that the difference of gradients at two points is bounded askrf(y) ? rf(z)k2 M ky ? zk2. For some problems,fmay be much smoother in certain di

35、rections than others. In such cases, the smoothness constantM will need to refl ect the worst-case direction. To produce a tighter bound for such situations, we generalize the notion of smoothness to allow M to be a symmetric matrix. Defi nition 4. f is M-matrix-smooth if krf(y) ? rf(z)k2 kM(y ? z)k

36、2(for symmetric M). We can generalize the result in Thm. 3 to functions with this matrix-smoothness condition. The proof is very similar. The main difference is that after applying the smoothness condition, the matrixM needs to be “absorbed” into the parameters w = (m,C) before applying Lem. 2. Theo

37、rem 5.SupposefisM-matrix smooth, zis a stationary point off, andsis standardized with u 2 Rdand Eu4 i = . Let g = rwf (Tw(u) for u s. Then, Ekgk2 2 (d + 1)kM(m ? z)k2 2+ (d + )kMCk 2 F .(4) Moreover, this result is unimprovable without further assumptions. Proof.The proof closely mirrors that of Thm

38、. 3. Here, givenw = (m,C), we defi nev = (Mm,MC), to be w with M “absorbed” into the parameters. Ekgk2 2 = Ekrwf(Tw(u)k2 2 Defi nition of g) = Ekrf(Tw(u)k2 2(1 + kuk 2 2) (Lem. 1) = Ekrf(Tw(u) ? rf( z)k2 2(1 + kuk 2 2) (rf( z) = 0) EkM (Tw(u) ? z)k2 2(1 + kuk 2 2) (f is smooth) = EkTv(u) ? M( z ? m)

39、k2 2(1 + kuk 2 2) (Absorb M into v) = (d + 1)kMm ? M zk2 2+ (d + )kMCk 2 F .(Lem. 2) To see that this is unimprovable, observe that the only inequality is the matrix-smoothness condition onf. But forf(z) = 1 2(z ? z) M(z ? z), the difference of gradientskrf(y) ? rf(z)k2= kM(y ? z)k2is an equality. T

40、hus, for anyMand z, there is anfsatisfying the assumptions of the theorem such that the bound in Eq. (4) is an equality. Its easy to see that this reduces to Thm. 3 in the case thatfis smooth in the standard sense this corresponds to the situation whereMis some constant times the identity. Alternati

41、vely, one can simply observe that the two results are the same ifMis a scalar. Thus, going forward we will use Eq. (4) to represent the result with either type of smoothness assumption on f. 4 3.4Subsampling Often, the functionf(z)takes the form of a sum over other functionsfn(z), typically represen

42、ting different data. Write this as f(z) = N X n=1 fn(z). To estimate the gradient ofEusf(Tw(u), one can save time by using “subsampling”: That is, draw a randomn, and then estimate the gradient ofEusfn(Tw(u). The following result bounds this procedure. It essentially just takes a set of estimators,

43、one corresponding to each functionfn, bounds their expected squared norm using the previous theorems, and then combines these. Theorem 6.SupposefnisMn-matrix-smooth, znis a stationary point offn, andsis standardized with u 2 Rdand Eu4 i = . Let g = 1 (n)rfn(Tw(u) for u s and n independent. Then, Ekg

44、k2 2 N X n=1 1 (n) (d + 1)kMn(m ? zn)k2 2+ (d + )kMnCk 2 F .(5) Moreover, this result is unimprovable without further assumptions. Proof.Consider a simple lemma: Supposea1aNare independent random vectors andis any distribution over1N.Letb = an/(n)forn , wherenis independent ofan.It is easy to show t

45、hatEb = PN n=1Ean andEkbk2 2 = P nEkank 2 2/(n).The result follows from applying this with an= rwfn(Tw(u), and then bounding Ekank2 2using Thm. 5. Again, in this result the only source of looseness is the use of the smoothness bound for the component functionsfn.Accordingly, the result can be shown

46、to be unimprovable: For any set of stationary points zand smoothness parametersMnwe can construct functionsfn(as in Thm. 5) for which the previous theorems are tight and thus this result is also tight. This result generalizes all the previous bounds: Thm. 5 is the special case whenN = 1, while Thm.

47、3 is the special-case when N = 1 and f1is M1-smooth (for a scalar M1). The case where N 1 but fnisMn-smooth (for scalarMn) is also useful the bound in Eq. (5) remains valid, but with a scalar Mn. 4Empirical Evaluation 4.1Model and Datasets We consider Bayesian linear regression and logistic regressi

48、on models on various datasets (Table 1). Given data(x1,y1),(xN,yN), letybe a vector of allynandXa matrix of allxn.We assume a Gaussian prior so thatp(z,y|X) = N(z|0,?2I) QN n=1p(yn|z,xn).For linear regression, p(yn|z,xn) = N(yn|zxi,2), while for logistic regression,p(yn|z,xn) = Sigmoid(ynzxn). For b

49、oth models we use a prior of ?2= 1. For linear regression, we set 2= 4. To justify the use of VI, apply the decomposition in Eq. (1) substitutingp(z,y|X)in place ofp(z,x) to get that logp(y|X) = E zq log p(z,y|X) q(z) + KL(q(z)kp(z|y,X). Thus, adjusting the parameters ofq to maximize the fi rst term

50、 on the right tightens a lower-bound on the conditional log likelihoodlogp(y|X)and minimizes the divergence fromqto the posterior. So, we again take our goal as maximizingl(w) + h(w). In the batch setting,f(z) = logp(z,y|X), while with subsampling, fn(z) = 1 N logp(z) + logp(yn|z,xn). 5 10.01000.0 k

51、 1e5 1e8 1e11 1e14 g 2 Grad Estimator batch uniform bound (matrix) bound (scalar) true mushrooms Figure 1:How loose are the bounds compared to reality?Odd Rows: Evolution of the ELBO during the single optimization trace used to compare all estimators. Even Rows: True and bounded variancewithgradient

52、sestimatedin“batch”(usingthefulldatasetineachevaluation)and“uniform” (stochastically with(n) = 1/N ). The fi rst two rows are for linear regression models, while the rest are for logistic regression. Key Observations: (i) Batch estimation is lower-variance but higher cost (ii) variance with stochast

53、ic estimation varies little over time (iii) using matrix smoothness signifi cantly tightens bounds and is exact for linear regression models. Sec. 8 shows that if0 ?00(t) ,then PN n=1?(a nz + bn)isM-matrix-smooth forM = PN i=1aia i . Applying this1gives that f(z) and fn(z) are matrix-smooth for M =

54、1 ?2 I + c N X n=1 xnx n, and Mn= 1 N?2 I + c xnx n, 1For linear regression, set ?(t) = ?t2/(22),an= xnandbn= ynand observe that?00= ?1/2. For logistic regression, set?(t) = logSigmoid(t),an= ynxnandbn= 0and observe that?00 1/4. Adding the prior and using the triangle inequality gives the result. 6

55、1e41e81e12 g 2 batch opt (matrix) opt (scalar) proportional uniform Grad Estimator 1.18e+09 9.35e+10 1.15e+11 1.15e+11 1.15e+11 2.42e+12 5.93e+13 4.79e+13 4.79e+13 4.79e+13 1.53e+05 1.03e+08 9.66e+07 8.9e+07 3.93e+07 1.53e+05 1.03e+08 9.66e+07 8.9e+07 3.93e+07 Bound Type bound (matrix) bound (scalar

56、) mushrooms Figure 2:Tightening variance bounds reduces true variance.A comparison of the true (vertical bars) and boundedEkgk2 values produced using fi ve different gradient estimators.Batchdoes not use subsampling.Uniformuses subsampling(n) = 1/N,proportionaluses(n) / Mn, opt (scalar)numerically o

57、ptimizes(n)to tighten Eq. (5) with a scalarMnandopt (matrix) tightens Eq. (5) with a matrixMn. For each sampling strategy, we show the variance bound both with a scalar and matrixMn. Uniform sampling has true and bounded values ofEkgk2ranging between 1.5x and 10 x higher than those for sampling with

58、 numerically optimized. DatasetType# data# dims bostonr50613 fi resr51712 cpusmallr819213 a1ac1695124 ionospherec35135 australianc69015 sonarc20861 mushroomsc8124113 Table 1: Regression (r) and classifi cation (c) datasets wherec = 1/2for linear regression, andc = 1/4 for logistic regression. Taking

59、 the spectral norm of these matrices gives scalar smoothness constants. With subsampling, this is kMnk2= 1 ?2N + ckxnk2. 4.2Evaluation of Bounds To enable a clear comparison of of different estima- tors and bounds, we generate a single optimization trace of parameter vectorswfor each dataset. All comparisons use this same trace. These use a

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論