版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)
文檔簡介
1、A New Distribution on the Simplex with Auto-Encoding Applications Andrew Stirn, Tony Jebara, David A Knowles Department of Computer Science Columbia University New York, NY 10027 andrew.stirn,jebara, Abstract We construct a new distribution for the simplex using the Kumaraswa
2、my dis- tribution and an ordered stick-breaking process. We explore and develop the theoretical properties of this new distribution and prove that it exhibits symmetry (exchangeability) under the same conditions as the well-known Dirichlet. Like the Dirichlet, the new distribution is adept at captur
3、ing sparsity but, unlike the Dirichlet, has an exact and closed form reparameterizationmaking it well suited for deep variational Bayesian modeling. We demonstrate the distributions utility in a variety of semi-supervised auto-encoding tasks. In all cases, the resulting models achieve competitive pe
4、rformance commensurate with their simplicity, use of explicit probability models, and abstinence from adversarial training. 1Introduction The Variational Auto-Encoder (VAE) 12 is a computationally effi cient approach for performing variational inference 11,27 since it avoids per-data-point variation
5、al parameters through the use of an inference network with shared global parameters. For models where stochastic gradient variational Bayes requires Monte Carlo estimates in lieu of closed-form expectations, 23,12 note that low- variance estimators can be calculated from gradients of samples with re
6、spect to the variational parameters that describe their generating densities. In the case of the normal distribution, such gradients are straightforward to obtain via an explicit, tractable reparameterization, which is often referred to as the “reparameterization trick”. Unfortunately, most distribu
7、tions do not admit such a convenient reparameterization. Computing low-bias and low-variance stochastic gradients is an active area of research with a detailed breakdown of current methods presented in 4. Of particular interest in Bayesian modeling is the well-known Dirichlet distribution that often
8、 serves as a conjugate prior for latent categorical variables. Perhaps the most desirable property of a Dirichlet prior is its ability to induce sparsity by concentrating mass towards the corners of the simplex. In this work, we develop a surrogate distribution for the Dirichlet that offers explicit
9、, tractable reparameterization, the ability to capture sparsity, and has barycentric symmetry (exchangeability) properties equivalent to the Dirichlet. Generative processes can be used to infer missing class labels in semi-supervised learning. The fi rst VAE-based method that used deep generative mo
10、dels for semi-supervised learning derived two variational objectives for the same the generative processone for when labels are observed and one for when labels are latentthat are jointly optimized 13. As they note, however, the variational distribution over class labels appears only in the objectiv
11、e for unlabeled data. Its absence from jointly affi liated with New York Genome Center jointly affi liated with Spotify Technology S.A. jointly affi liated with Columbia Universitys Data Science Institute and the New York Genome Center 33rd Conference on Neural Information Processing Systems (NeurIP
12、S 2019), Vancouver, Canada. the labeled-data objective, as they point out, results from their lack of a (Dirichlet) prior on the (latent) labels. We suspect they neglected to specify this prior, because, at the time, it would have rendered inference intractable. They ameliorate this shortcoming by i
13、ntroducing a discriminative third objective, the cross-entropy of the variational distribution over class labels, which they compute over the labeled data. They then jointly optimize the two variational objectives after adding a scaled version of the cross-entropy term. Our work builds on 13, while
14、offering some key improvements. First, we remove the need for adding an additional discriminative loss through our use of a Dirichlet prior. We overcome intractability using our proposed distribution as an approximation for the Dirichlet posterior. Naturally, our generative process is slightly diffe
15、rent, but it allows us to consider only unmodifi ed variational objectives. Second, we do not stack models together. Kingma et al.s best results utilized a standard VAE (M1) to learn a latent space upon which their semi-supervised VAE (M2) was fi t. For SVHN data, they perform dimensionality reducti
16、on with PCA prior to fi tting M1. We abandon the stacked-model approach in favor of training a single model with more expressive recognition and generative networks. Also, we use minimal preprocessing (rescaling pixel intensities to 0,1). Use of the Kumaraswamy distribution 14 by the machine learnin
17、g community has only occurred in the last few years. It has been used to fi t Gaussian Mixture Models, for which a Dirichlet prior is part of the generative process, with VAEs 19. To sample mixture weights from the variational posterior, they recognize they can decompose a Dirichlet into its stick-b
18、reaking Beta distributions and approximate them with the Kumaraswamy distribution. We too employ the same stick-breaking decomposition coupled with Kumaraswamy approximations. However, we improve on this technique by expounding and resolving the order-dependence their approximation incurs. As we det
19、ail in section 2, using the Kumaraswamy for stick-breaking is not order agnostic (exchangeable); the generated variable has a density that depends on ordering. We leverage the observation that one can permute a Dirichlets parameters, perform the stick-breaking sampling procedure with Beta distributi
20、ons, and undo the permutation on the sampled variable without affecting its density. Those same authors also use this Beta-Kumaraswamy stick-breaking approximation to fi t a Bayesian non- parametric model with a VAE 20. Here too, they do not account for the impact ordering has on their approximation
21、. Their latent space, being non-parametric, grows in dimensions when it insuffi ciently represents the data. As we demonstrate in section 2.2 and fi g. 1, approximating sparse Dirichlet samples with the Kumaraswamy stick-breaking decomposition without accounting for the ordering dependence produces
22、a large bias in the samples last dimension. We conjecture that their Bayesian non-parametric model would utilize fewer dimensions with our proposed distribution and would be an interesting follow up to our work. Figure 1: Sampling bias for a 5-dimensional sparsity-inducing Dirichlet approximation us
23、ing = 1 5(1,1,1,1,1). We maintain histograms for each sample dimension for three methods: Dirichlet, Kumaraswamy stick-breaks with a fi xed order, Kumaraswamy stick-breaks with a random ordering. Note the bias on the last dimension when using a fi xed order. Randomizing order eliminates this bias. 2
24、A New Distribution on the Simplex The stick-breaking process is a sampling procedure used to generate aKdimensional random variable in theK 1simplex. The process requires sampling fromK 1(often out ofK) distributions each with support over0,1. Letpi(v;ai,bi)be some distribution forv 0,1parameterized
25、 byaiand bi. Letobe some ordering (permutation) of1,.,K. Then, algorithm 1 captures a generalized stick-breaking process. The necessity for incorporating ordering will become clear in section 2.1. 2 Algorithm 1 A Generalized Stick-Breaking Process Require: K 2, base distributions pi(v;ai,bi) i 1,.,K
26、, and some ordering o Sample: vo1 po1(v;ao1,bo1) Assign: xo1 vo1,i 2 while i 0and support forx 0,1with PDFf(x;a,b) = abxa1(1 xa)b1and CDFF(x;a,b) = 1 (1 xa)b. With this analytically invertible CDF, one can reparameterize a sampleufrom the continuous 3 Uniform(0,1)via the transformT(u) = (1(1u)1/b)1/
27、asuch thatT(u) Kumaraswamy(a,b). Unfortunately, this convenient reparameterization comes at a cost when we derivep(xo1,.,xoK;), which captures the density of the variable returned by algorithm 1. If, in a manner similar to generating a Dirichlet sample from Beta samples, we letpi(v;ai,bi) Kumaraswam
28、y(x;i, PK j=i+1j), then the resulting variables density is no longer order agnostic (exchangeable). The exponential in the Kumaraswamys(1 xa)term that admits analytic inverse-CDF sampling, can no longer cancel out det(JT1 i )terms as the(1 x)term in the Beta analog could. In the simplest case, the 1
29、-simplex (K = 2), the possible orderings for algorithm 1 areo O = 1,2,2,1. Indeed, algorithm 1 returns two distinct densities according to their respective orderings: f12(x;a,b) = 12x11 1 x21 2 1 x1 1 1 x1 !21 (3) f21(x;a,b) = 12x11 1 x21 2 1 x2 2 1 x2 !11 .(4) In section 7.2 of the appendix, we der
30、ivef12andf21as well as the distribution for the 2-simplex, which has orderingso O = 1,2,3,1,3,2,2,1,3,2,3,1,3,1,2,3,2,1. For K 3, the algebraic book-keeping gets rather involved. We thus rely on algorithm 1 to succinctly represent the complicated densities over the simplex that describe the random v
31、ariables generated by a stick-breaking process using the Kumaraswamy distribution as the base (stick-breaking) distribution. Our code repository contains a symbolic implementation of algorithm 1 with the Kumaraswamy that programmatically keeps track of the algebra. 2.2The multivariate Kumaraswamy We
32、 posit that a good surrogate for the Dirichlet will exhibit symmetry (exchangeability) properties identical to the Dirichlet it is approximating. If our stick-breaking distribution,pi(v;ai,bi), cannot achieve symmetry for all valuesai= bi 0, then it is possible that the samples will exhibit bias (fi
33、 g.1). Ifx Beta(a,b), then(1x) Beta(b,a). Itfollowsthenthatwhena = b,p(x) = p(1x). Unfortunately,Kumaraswamy(a,b)does not admit such symmetry for alla = b 0. However, hope is not lost. From 6, 8, we have lemma 1. Lemma 1Given a functionfofnvariables, one can induce symmetry by taking the sum offover
34、 all n! possible permutations of the variables. If we defi nefo(xo1,.,xoK;o1,.,oK)to be the joint density of theK-dimensional ran- dom variable returned from algorithm 1 with stick-breaking base distribution aspi(v;ai,bi) Kumaraswamy(x;i, PK j=i+1j) and some orderingo, then our proposed distribution
35、 for the (K 1)-simplex is MV-Kumaraswamy(x;) =E oUniform(O)fo(xo1,.,xoK ;o1,.,oK),(5) whereMV-Kumaraswamystands for Multivariate Kumaraswamy. Here,Ois the set of all possible orderings (permutations) of1,.,K. In the context of 8, we create a U-statistic over the variablesx,. The expectation in eq. (
36、5) is a summation since we are uniformly samplingofrom a discrete set. We therefore can apply lemma 1 to eq. (5) to prove corollary 1. Corollary 1LetS 1,.,Kbe the set of indicesiwhere fori 6= jwe havei= j . Defi ne A = 1,.,K S. Then,EoUniform(O)fo(xo1,.,xoK;o1,.,oK)is symmetric across barycentric ax
37、es xa a A. While the factorial growth (|O| = K!) for full symmetry is undesirable, we expect approximate symmetry should arise, in expectation, afterO(K)samples. Since the problematic bias occurs during the last stick break, each label ideally experiences an ordering where it is not last; this occur
38、s with probability K1 K . Thus, a label is not last, in expectation, after K K1 draws fromUniform(O). 4 Therefore, to satisfy this condition for all labels, one needs K2 K1 = O(K)samples, in expectation. An alternative, which we discuss and demonstrate below in fi g. 4, would be to use theKcyclic or
39、derings (e.g.1,2,3,2,3,1,3,1,2forK = 3) to achieve approximate symmetry (exchangeability). In fi g. 2, we provide 1-simplex examples for varyingthat demonstrate the effect ordering has on the Kumaraswamy distributionsf12(x;)andf21(x;)(respectively in eqs. (3) and (4). In each exam- ple, we plot the
40、symmetrized versions arising from our proposed distributionEofo(x;)(eq. (5). For reference, we plot the correspondingDirichlet(x;), which is equivalent toBeta(x1;1,2) for the 1-simplex. Qualitatively, we observe how effectively our proposed distribution resolves the differences between f12and f21and
41、 yields a Efo(x;) Dirichlet(x;). Figure 2: Kumaraswamy asymmetry and symmetrization examples on the 1-simplex. In fi g. 3, we employ Beta distributed stick breaks to generate a Dirichlet random variable. In this example, we pick ansuch that the resulting density should be symmetric only about the ba
42、rycentric x1axis. Furthermore, because the resulting density is a Dirichlet, the densities arising from all possible orderings should be identical with identical barycentric symmetry properties. The fi rst row contains densities. The subsequent rows measure asymmetry across the specifi ed barycentri
43、c axis by computing the absolute difference of the PDF folded along that axis. The fi rst column is for expectation over all possible orderings. The second column is for the expectation over the cyclic orderings. Each column thereafter represents a different stick-breaking order. Indeed, we fi nd th
44、at the Dirichlet has an order agnostic density with symmetry only about the barycentric x1axis. Figure 3: 2-simplex with Beta sticksFigure 4: 2-simplex with Kumaraswamy sticks In fi g. 4, we employ the same methodology with the same as in fi g. 3, but this time we use Kumaraswamy distributed stick b
45、reaks. Note the signifi cant variations among the densities resulting from the different orderings. It follows that symmetry/asymmetry too vary with respect to ordering. We only see the desired symmetry about the barycentricx1axis when we take the expectation over all orderings. This example qualita
46、tively illustrates corollary 1. However, we do achieve approximate symmetry when we average over theKcyclic orderingssuggesting we can, in practice, get away with linearly scaling complexity. 3Gradient Variance We compare our methods gradient variance to other non-explicit gradient reparameterizatio
47、n methods: Implicit Reparameterization Gradients (IRG) 4, RSVI 18, and Generalized Reparameterization 5 Gradient (GRG) 22 . These works all seek gradient methods with low variance. In fi g. 5, we compare MV-Kumaraswamys (MVK) gradient variance to these other methods by leveraging techniques and code
48、 from 18 . Specifi cally, we consider their test that fi ts a variational Dirichlet posterior to Categorical data with a Dirichlet prior. In this conjugate setting, true analytic gradients can be computed. Their reported gradient variance is actually the mean square error with respect to the true gr
49、adient. In our test, however, we are fi tting aMV-Kumaraswamyvariational posterior. Therefore, we compute gradient variance, for all methods, according to variances more common defi nition. Our tests show that IRG and RSVI (B = 10) offer similar variance; this result matches fi ndings in 4. Figure 5
50、: Variance of the ELBOs gradients fi rst dimension for GRG 22, RSVI 18, IRG 4, and MVK (ours) when fi tting a variational posterior to Categorical data with 100 dimensions and a Dirichlet prior. They fi t a Dirichlet. We fi t aMV-KumaraswamyusingK = 100samples from Uniform(O)to Monte-Carlo approxima
51、te the full expectation; this corresponds to linear complexity. 4A single generative model for semi-supervised learning We demonstrate the utility of the MV-Kumaraswamy in the context of a parsimonious generative model for semi-supervised learning, with observed datax, partially observable classes/l
52、abelsywith prior and latent variable z, all of which are local to each data point. We specify, Dirichlet(;),z N(z;0,I), y| Discrete(y;),x|y,z p(x|f(y,z), wheref(y,z)isaneuralnetwork, withparameters, operatingonthelatentvariables. Forobservable y , the evidence lower bound (ELBO) for a mean-fi eld po
53、sterior approximationq(,z) = q()q(z)is lnp(x,y) E q(,z)lnp(x|f(y,z) + lny DKL(q() | p() DKL(q(z) | p(z) Ll(x,y,).(6) For latenty, we can derive an alternative ELBO that corresponds to the same generative process of eq. (6), by reintroducingyvia marginalization. We derive eqs. (6) and (7) in section
54、7.3 of the appendix. lnp(x) E q(,z) h ln X y p(x|f(y,z)y i DKL(q() | p() DKL(q(z) | p(z) Lu(x,)(7) LetLbe our set of labeled data andUbe our unlabeled set. We then consider a combined objective L = 1 |L| X (x,y)L Ll(x,y,) + 1 |U| X xU Lu(x,)(8) 1 B X (xi,yi)L iB Ll(xi,yi,) + 1 B X xiU iB Lu(xi,)(9)
55、6 that balances the two ELBOs evenly. Of concern is when|U| ? |L|. Here, the optimizer could effectively ignoreLl(x,y,). This possibility motivates our rebalancing in eq. (8). During optimization we employ batch updates of sizeBto maximize eq. (9), which similarly balances the contribution betweenUa
56、ndL . We defi ne an epoch to be the set of batches (sampled without replacement) that constituteU. Therefore, when|U| ? |L|, the optimizer will observe samples from Lmany more times than samples fromU. Intuitively, the data with observable labels in conjunction with eq. (6) breaks symmetry and encou
57、rages the correct assignment of classes to labels. Following 12,13, we use an inference network with parameters and defi ne our variational distributionq(z) = N(z;(x),(x), where(x)and(x)are outputs of a neural network operating on the observable data. We restrict(x)to output a diagonal covariance an
58、d use a softplus, ln(exp(x) + 1), output layer to constrain it to the positive reals. Since(x) Rdim(z), we use an affi ne output layer. We letq() = MV-Kumaraswamy(;(x), where(x)is also an output of our inference network. We similarly restrict (x) to the positive reals via the softplus activation. We evaluate the expectations in eqs. (6) and (7) using Monte-Carlo integration. Forq(z), we sample fromN(0,I)and utilize the reparameterization trick. Sinceq()contains an expectation ove
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 江蘇省2025九年級物理上冊第十三章簡單電路第五節(jié)串并聯(lián)電路的特點第1課時探究串并聯(lián)電路的特點課堂鞏固課件新版蘇科版
- 少兒美術(shù)教學(xué):對稱圖形的簡單畫法-放風(fēng)箏
- 《汽車保險與理賠》課件-項目三學(xué)習(xí)任務(wù)一、認(rèn)識汽車保險理賠
- 《內(nèi)科護理》課件-第2章 第04節(jié) 慢性支氣管炎和阻塞性肺疾病病人的護理
- 內(nèi)分泌科甲狀腺功能亢進患者食譜指導(dǎo)
- (新教材)2026年北師大版一年級下冊數(shù)學(xué) 期末總復(fù)習(xí)(4)綜合與實踐 課件
- (新教材)2026年北師大版三年級上冊數(shù)學(xué) 年月日知多少 課件
- 文具盲盒教育主題班會
- 血液科白血病化療護理指導(dǎo)培訓(xùn)指南
- 病理科入科教育體系框架
- 機械原理發(fā)展史總結(jié)
- 如何做好信訪工作
- 譯林 英語 五年級下冊 電子課本
- 四川省廣安市武勝縣+2023-2024學(xué)年九年級上學(xué)期期末考試道德與法治試題
- 北京市海淀區(qū)衛(wèi)生學(xué)校招聘真題
- 鋼筋焊接施工安全技術(shù)交底
- 銷售授權(quán)書模板
- 2021年10月全國自學(xué)考試00265西方法律思想史試題答案
- 2023年關(guān)于寧波市鄞州糧食收儲有限公司公開招聘工作人員筆試的通知筆試備考題庫及答案解析
- 經(jīng)典離騷公開課
- GB/T 18318-2001紡織品織物彎曲長度的測定
評論
0/150
提交評論