iccv2019論文全集8318-asymptotic-guarantees-for-learning-generative-models-with-the-sliced-wasserstein-distance_第1頁
iccv2019論文全集8318-asymptotic-guarantees-for-learning-generative-models-with-the-sliced-wasserstein-distance_第2頁
iccv2019論文全集8318-asymptotic-guarantees-for-learning-generative-models-with-the-sliced-wasserstein-distance_第3頁
iccv2019論文全集8318-asymptotic-guarantees-for-learning-generative-models-with-the-sliced-wasserstein-distance_第4頁
iccv2019論文全集8318-asymptotic-guarantees-for-learning-generative-models-with-the-sliced-wasserstein-distance_第5頁
已閱讀5頁,還剩6頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、Asymptotic Guarantees for Learning Generative Models with the Sliced-Wasserstein Distance Kimia Nadjahi1, Alain Durmus2, Umut Sim sekli1,3, Roland Badeau1 1: LTCI, Tlcom Paris, Institut Polytechnique de Paris, France 2: CMLA, ENS Cachan, CNRS, Universit Paris-Saclay, France 3: Department of Statisti

2、cs, University of Oxford, UK kimia.nadjahi, umut.simsekli, roland.badeautelecom-paris.fr alain.durmuscmla.ens-cachan.fr Abstract Minimum expected distance estimation (MEDE) algorithms have been widely used for probabilistic models with intractable likelihood functions and they have become increasing

3、ly popular due to their use in implicit generative modeling (e.g. Wasserstein generative adversarial networks, Wasserstein autoencoders). Emerging from computational optimal transport, the Sliced-Wasserstein (SW) distance has become a popular choice in MEDE thanks to its simplicity and computational

4、 benefi ts. While several studies have reported empirical success on generative modeling with SW, the theoretical properties of such estimators have not yet been established. In this study, we investigate the asymptotic properties of estimators that are obtained by minimizing SW. We fi rst show that

5、 convergence in SW implies weak convergence of probability measures in general Wasserstein spaces. Then we show that estimators obtained by minimizing SW (and also an approximate version of SW) are asymptotically consistent. We fi nally prove a central limit theorem, which characterizes the asymptot

6、ic distribution of the estimators and establish a convergence rate of n, where ndenotes the number of observed data points. We illustrate the validity of our theory on both synthetic data and neural networks. 1Introduction Minimum distance estimation (MDE) is a generalization of maximum-likelihood i

7、nference, where the goal is to minimize a distance between the empirical distribution of a set of independent and identically distributed (i.i.d.) observationsY1:n= (Y1,.,Yn)and a family of distributions indexed by a parameter . The problem is formally defi ned as follows 1, 2: n= argminD( n,) ,(1)

8、whereDdenotes a distance (or a divergence in general) between probability measures,denotes a probability measure indexed by , denotes the parameter space, and n= 1 n Xn i=1 Yi(2) denotes the empirical measure ofY1:n, withYbeing the Dirac distribution with mass on the pointY. WhenDis chosen as the Ku

9、llback-Leibler divergence, this formulation coincides with the maximum likelihood estimation (MLE) 2. While MDE provides a fruitful framework for statistical inference, when working with generative models, solving the optimization problem in(1)might be intractable since it might be impossible to eva

10、luate the probability density function associated with. Nevertheless, in various settings, even if 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. the density is not available, one can still generate samples from the distribution, and such samples turn out

11、 to be useful for making inference. More precisely, under such settings, a natural alternative to (1) is the minimum expected distance estimator, which is defi ned as follows 3: n,m= argminED( n, ,m)|Y1:n .(3) Here, ,m= 1 m Xm i=1 Zi(4) denotes the empirical distribution ofZ1:m, that is a sequence o

12、f i.i.d. random variables with distribu- tion. This algorithmic framework has computationally favorable properties since one can replace the expectation with a simple Monte Carlo average in practical applications. In the context of MDE, distances that are based on optimal transport (OT) have become

13、increasingly popular due to their computational and theoretical properties 4,5,6,7,8. For instance, if we replace the distanceDin(3) with the Wasserstein distance (defi ned in Section 2 below), we obtain the minimum expected Wasserstein estimator 3. In the classical statistical inference setting, th

14、e typical use of such an estimator is to infer the parameters of a measure whose density does not admit an analytical closed-form formula 2. On the other hand, in the implicit generative modeling (IGM) setting, this estimator forms the basis of two popular IGM strategies: Wasserstein generative adve

15、rsarial networks (GAN) 4 and Wasserstein variational auto-encoders (VAE) 5 (cf. 9 for their relation). The goal of these two methods is to fi nd the best parametric transport mapT, such thatT transforms a simple distribution(e.g. standard Gaussian or uniform) to a potentially complicated data distri

16、bution nby minimizing the Wasserstein distance between the transported distribution = Tand n, where denotes the push-forward operator, to be defi ned in the next section. In practice,is typically chosen as a neural network, for which it is often impossible to evaluate the induced density. However, o

17、ne can easily generate samples from by fi rst generating a sample fromand then applyingTto that sample, making minimum expected distance estimation (3)feasible for this setting. Motivated by its practical success, the theoretical properties of this estimator have been recently taken under investigat

18、ion 10,11 and very recently Bernton et al. 3 have established the consistency (for the general setting) and the asymptotic distribution (for one dimensional setting) of this estimator. Even though estimation with the Wasserstein distance has served as a fertile ground for many generative modeling ap

19、plications, except for the case when the measures are supported onR1, the computational complexity of minimum Wasserstein estimators rapidly becomes excessive with the increasing problem dimension, and developing accurate and effi cient approximations is a highly non-trivial task. Therefore, there h

20、ave been several attempts to use more practical alternatives to the Wasserstein distance 12,6. In this context, the Sliced-Wasserstein (SW) distance 13,14,15 has been an increasingly popular alternative to the Wasserstein distance, which is defi ned as an average of one-dimensional Wasserstein dista

21、nces, which allows it to be computed in an effi cient manner. While several studies have reported empirical success on generative modeling with SW 16,17,18, 19, the theoretical properties of such estimators have not yet been fully established. Bonnotte 14 proved that SW is a proper metric, and in co

22、mpact domains SW is equivalent to the Wasserstein distance, hence convergence in SW implies weak convergence in compact domains. 14 also analyzed the gradient fl ows based on SW, which then served as a basis for a recently proposed IGM algorithm 18. Finally, recent studies 16,20 investigated the sam

23、ple complexity of SW and established bounds for the SW distance between two measures and their empirical instantiations. In this paper, we investigate the asymptotic properties of estimators given in(1)and(3)whenDis replaced with the SW distance. We fi rst prove that convergence in SW implies weak c

24、onvergence of probability measures defi ned on general domains, which generalizes the results given in 14. Then, by using similar techniques to the ones given in 3 , we show that the estimators defi ned by (1)and(3)are consistent, meaning that as the number of observationsnincreases the estimates wi

25、ll get closer to the data-generating parameters. We fi nally prove a central limit theorem (CLT) in the multidimensional setting, which characterizes the asymptotic distribution of these estimators and establishes a convergence rate of n. The CLT that we prove is stronger than the one given in 3 in

26、the sense that it is not restricted to the one-dimensional setting as opposed to 3. We support our theory with experiments that are conducted on both synthetic and real data. We fi rst consider a more classical statistical inference setting, where we consider a Gaussian model and a 2 multidimensiona

27、l-stable model whose density is not available in closed-form. In both models, the experiments validate our consistency and CLT results. We further observe that, especially for high-dimensional problems, the estimators obtained by minimizing SW have signifi cantly better computational properties when

28、 compared to the ones obtained by minimizing the Wasserstein distance, as expected. In the IGM setting, we consider the neural network-based generative modeling algorithm proposed in 16 and show that our results also hold in the real data setting as well. 2Preliminaries and Technical Background We c

29、onsider a probability space(,F,P)with associated expectation operatorE, on which all the random variables are defi ned. Let(Yk)kNbe a sequence of random variables associated with observations, where each observation takes value inY Rd. We assume that these observations are i.i.d. according to ? P(Y)

30、, where P(Y) stands for the set of probability measures on Y. A statistical model is a family of distributions onYand is denoted byM = P(Y), , where Rdis the parametric space. In this paper, we focus on parameter inference for purely generative models: for all , we can generate i.i.d. samples(Zk)kN

31、YN from, but the associated likelihood is numerically intractable. In the sequel,(Zk)kNdenotes an i.i.d. sequence fromwith , and for anym N, ,m= (1/m) Pm i=1Zi denotes the corresponding empirical distribution. Throughout our study, we assume that the following conditions hold: (1)Y, endowed with the

32、 Euclidean distance, is a Polish space, (2), endowed with the distance, is a Polish space, (3) is a-compact space, i.e. the union of countably many compact subspaces, and (4) parameters are identifi able, i.e.= 0implies = 0. We endowP(Y)with the Lvy-Prokhorov distancedP, which metrizes the weak conv

33、ergence by 21, Theorem 6.8 sinceYis assumed to be a Polish space. We denote by Y the Borel -fi eld of (Y,). Wasserstein distance.Forp 1, we denote byPp(Y)the set of probability measures onYwith fi nitepth moment:Pp(Y) = ? P(Y) :R Yky y0k p d(y) 0, such that setting?= infSWp(?,), the set? ? = : SWp(?

34、,) ?+ ? is bounded. These assumptions are mostly related to the identifi ability of the statistical model and the regularity of the data generating process. They are arguably mild assumptions, analogous to those that have already been considered in the literature 3. Note that, without Theorem 1, the

35、 formulation and use 4 ofA2 in our proofs in the supplementary document would not be possible. In the next result, we establish the consistency of MSWE. Theorem 2(Existence and consistency of MSWE).AssumeA1,A2 andA3. There existsE F with P(E) = 1 such that, for all E, lim n+ inf SWp( n(),) = inf SWp

36、(?,), and(8) limsup n+ argminSWp( n(),) argminSWp(?,) ,(9) where n is defi ned by(2). Besides, for all E, there existsn()such that, for alln n(), the set argminSWp( n(),) is non-empty. Our proof technique is similar to the one given in 3. This result shows that, when the number of observations goes

37、to infi nity, the estimate nwill converge to a global minimizer of the problem minSWp(?,). In our next result, we prove a similar property for MESWEs asmin(m,n) goes to infi nity. In order to increase clarity, and without loss of generality, in this setting, we considermas a function ofn such thatli

38、mn+m(n) = +. Now, we derive an analogous version of Theorem 2 for MESWE. For this result, we need to introduce another continuity assumption. A4. If limn+(n,) = 0, then limn+ESWp(n, n,n)|Y1:n = 0. The next theorem establishes the consistency of MESWE. Theorem 3(Existence and consistency of MESWE).As

39、sumeA1,A2,A3 andA4. Let(m(n)nN be an increasing sequence satisfyinglimn+m(n) = +. There exists a setE with P(E) = 1 such that, for all w E, lim n+ inf E ?SW p( n, ,m(n) ? ?Y1:n?= inf SWp(?,), and(10) limsup n+ argminE ?SW p( n, ,m(n) ? ?Y1:n? argminSWp(?,) , (11) where nand ,m(n) are defi ned by(2)a

40、nd(4)respectively. Besides, for all E, there exists n() such that, for all n n(), the set argminESWp( n, ,m(n)|Y1:n is non-empty. Similar to Theorem 2, this theorem shows that, when the number of observations goes to infi nity, the estimator obtained with the expected distance will converge to a glo

41、bal minimizer. 3.3Convergence of MESWE to MSWE In practical applications, we can only use a fi nite number of generated samplesZ1:m. In this subsection, we analyze the case where the observationsY1:n are kept fi xed while the number of generated samples increases, i.e.m +and we show in this scenario

42、 that MESWE converges to MSWE, assuming the latter exists. Before deriving this result, we formulate a technical assumption below. A5.For some? 0and?n= infSWp( n,), the set?,n= : SWp( n,) ?n+ ? is bounded almost surely. Theorem 4 (MESWE converges to MSWE as m +). Assume A1, A4 and A5. Then, lim m+ i

43、nf ESWp( n, ,m)|Y1:n = inf SWp( n,)(12) limsup m+ argminESWp( n, ,m)|Y1:n argminSWp( n,)(13) Besides, there existsmsuch that, for anym m, the setargminESWp( n, ,m)|Y1:nis non-empty. This result shows that MESWE would be indeed promising in practice, as one get can more accurate estimations by increa

44、sing m. 5 3.4Rate of convergence and the asymptotic distribution In our last set of theoretical results, we investigate the asymptotic distribution of MSWE and we establish a rate of convergence. We now suppose that we are in the well-specifi ed setting, i.e. there exists?in the interior ofsuch that

45、?= ?, and we consider the following two assumptions. For anyu Sd1andt R , we defi neF(u,t) = R Y1(,t(hu,yi)d(y). Note that for any u Sd1, F(u,) is the cumulative distribution function (CDF) associated to the measure u? . A6. For all ? 0, there exists 0 such that inf: (,?)?SW1(?,) . LetL1(Sd1R)denote

46、 the class of functions that are absolutely integrable on the domainSd1R, with respect to the measure d Leb, where Leb denotes the Lebesgue measure. A7.Assume that there exists a measurable functionD?= (D?,1,.,D?,d) : Sd1 R 7 Rd such that for each i = 1,.,d, D?,i L1(Sd1 R) and Z Sd1 Z R |F(u,t) F?(u

47、,t) h ?,D?(u,t)i|dtd(u) = ?(,?) , where? : R+ R+ satisfi eslimt0?(t) = 0. Besides,D?,id i=1 are linearly independent in L1(Sd1 R). For anyu Sd1, andt R , defi ne: Fn(u,t) = n1cardi 1,.,n : hu,Yii t, where carddenotes the cardinality of a set, and for anyu Sd1, Fn(u,)is the CDF associated to the meas

48、ure u? n. A8.ThereexistsarandomelementG?: Sd1R 7 Rsuchthatthestochasticprocess n( FnF?) converges weakly in L1(Sd1 R) to G?1. Theorem 5.AssumeA1,A2,A3,A6,A7 andA8. Then, the asymptotic distribution of the goodness- of-fi t statistic is given by, n inf SW1( n,) w inf Z Sd1 Z R |G?(u,t) h,D?(u,t)i|dtd

49、(u),as n + , where n is defi ned by (2). Theorem 6.AssumeA1,A2,A3,A6,A7 andA8. Suppose also that the random map 7 R Sd1 R R|G?(u,t) h,D?(u,t)i|dtd(u) has a unique infi mum almost surely. Then, MSWE with p = 1 satisfi es, n( n ?) w argmin Z Sd1 Z R |G?(u,t) h,D?(u,t)i|dtd(u),as n + , wheren is defi n

50、ed by (1) with SW1in place of D. These results show that the estimator and the associated goodness-of-fi t statistics will converge to a random variable in distribution, where the rate of convergence is n. Note that G? is defi ned as a random element (seeA8), therefore we can not claim that the conv

51、ergence in distribution derived in Theorem 5 and 6 implies the convergence in probability. This CLT is also inspired by 3 , where they identifi ed the asymptotic distribution associated to the minimum Wasserstein estimator. However, sinceWpadmits an analytical form only whend = 1, their result is re

52、stricted to the scalar case, and in their conclusion, 3 conjecture that the rate of the minimum Wasserstein estimators would depend negatively on the dimension of the observation space. On the contrary, sinceSWp is defi ned in terms of one-dimensionalWpdistances, we circumvent the curse of dimension

53、ality and our result holds for any fi nite dimension. While the perceived computational burden has created a pessimism in the machine learning community about the use of Wasserstein-based methods in large dimensional settings, which motivated the rise of regularized optimal transport 26 , we believe

54、 that our fi ndings provide an interesting counter-example to this conception. 1Under mild assumptions on the tails of u? ?for anyu S d1, we believe that one can prove that A8 holds in general by extending 24, Proposition 3.5 and 25, Theorem 2.1a. 6 101102103104 number of observations n 0.00 0.05 0.

55、10 0.15 kb mn m?k22 101102103104 number of observations n 0.000 0.025 0.050 0.075 0.100 kb 2n 2?k22 (a) MSWE vs. n 101102103104 number of observations n 0.00 0.05 0.10 0.15 kb mn,n m?k22 101102103104 number of observations n 0.0 0.1 0.2 kb 2n,n 2?k22 (b) MESWE vs. n = m 101102103104 number of genera

56、ted samples m 0.0000 0.0002 0.0004 0.0006 kb mn,m b mnk22 101102103104 number of generated samples m 0.000 0.002 0.004 0.006 0.008 kb 2n,m b 2nk22 (c) MESWE with n = 2000 vs. m Figure 2: Min. SW estimation on Gaussians inR10. Figure 2a and Figure 2b show the mean squared error between(m?,2 ?) = (0,1

57、)and MSWE( mn, 2n)(resp. MESWE( mn,n, 2n,n) forn from 10 to 10000, illustrating Theorems 2 and 3. Figure 2c shows the error between( mn, 2 n)and ( mn,m, 2 n,m)for 2000 observations andmfrom 10 to 10000, to illustrate Theorem 4. Results are averaged over 100 runs, the shaded areas represent the standard deviation. 4Experiments We conduct experiments on synthetic and real data to empirically confi rm our theorems. We explain in Section 4 of the supplementary document the optimization methods used to fi nd the estimators. S

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論