iccv2019論文全集8387-explicit--of-appearance-and-perspective-in-generative-models_第1頁
iccv2019論文全集8387-explicit--of-appearance-and-perspective-in-generative-models_第2頁
iccv2019論文全集8387-explicit--of-appearance-and-perspective-in-generative-models_第3頁
iccv2019論文全集8387-explicit--of-appearance-and-perspective-in-generative-models_第4頁
iccv2019論文全集8387-explicit--of-appearance-and-perspective-in-generative-models_第5頁
已閱讀5頁,還剩7頁未讀 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

1、Explicit Disentanglement of Appearance and Perspective in Generative Models Nicki S. Detlefsen nsdedtu.dk Sren Hauberg sohaudtu.dk Abstract Disentangled representation learning fi nds compact, independent and easy-to- interpret factors of the data. Learning such has been shown to require an inductiv

2、e bias, which we explicitly encode in a generative model of images. Specifi cally, we propose a model with two latent spaces: one that represents spatial transformations of the input data, and another that represents the transformed data. We fi nd that the latter naturally captures the intrinsic app

3、earance of the data. To realize the gener- ative model, we propose a Variationally Inferred Transformational Autoencoder (VITAE) that incorporates a spatial transformer into a variational autoencoder. We show how to perform inference in the model effi ciently by carefully designing the encoders and

4、restricting the transformation class to be diffeomorphic. Empirically, our model separates the visual style from digit type on MNIST, separates shape and pose in images of human bodies and facial features from facial shape on CelebA. 1Introduction Disentangled Representation Learning (DRL) is a fund

5、amental challenge in machine learning that is currently seeing a renaissance within deep generative models. DRL approaches assume that an AI agent can benefi t from separating out (disentangle) the underlying structure of data into disjointed parts of its representation. This can furthermore help in

6、terpretability of the decisions of the AI agent and thereby make them more accountable. Even though there have been attempts to fi nd a single formalized notion of disentanglement Higgins et al., 2018, no such theory exists (yet) which is widely accepted. However, the intuition is that a disentangle

7、d representationzshould separate different informative factors of variation in the data Bengio et al., 2012. This means that changing a single latent dimensionzishould only change a single interpretable feature in the data space X. Within the DRL literature, there are two main approaches. The fi rst

8、 is to hard-wire disentanglement into the model, thereby creating an inductive bias. This is well known e.g. in convolutional neural networks, where the convolution operator creates an inductive bias towards translation in data. The second approach is to instead learn a representation that is faithf

9、ul to the underlying data structure, hoping that this is suffi cient to disentangle the representation. However, there is currently little to no agreement in the literature on how to learn such representations Locatello et al., 2019. We consider disentanglement of two explicit groups of factors, the

10、 appearance and the perspective. We here defi ne the appearance as being the factors of data that are left after transformingxby its perspective. Thus, the appearance is the form or archetype of an object and the perspective represents the specifi c realization of that archetype. Practically speakin

11、g, the perspective could correspond to an image rotation that is deemed irrelevant, while the appearance is a representation of the rotated image, which is then invariant to the perspective. This interpretation of the world goes back to Platos allegory of the cave, from which we also borrow our term

12、inology. This notion of removing Section for Cognitive Systems, Technical University of Denmark 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. A B C Perspective latent spaceAppearance latent space = 1 = 0 Figure 1:We disentangle data into appearance and p

13、erspective factors. First, data are encoded based on their perspective (in this case image A and C are rotated in the same way), which is then removed from the original input. Hereafter, the transformed samples can be encoded in the appearance space (image A and B are both ones), that encodes the fa

14、ctors left in data. , , Appearance (body pose) Perspective (body shape) Generated samples Figure 2:Our model, VITAE, disentangles appearance from perspective. Here we separate body pose (arm position) from body shape. perspective before looking at the appearance is well-studied within supervised lea

15、rning, e.g. using spatial transformer nets (STNs) Jaderberg et al., 2015. This paper contributesan explicit model for disentanglement of appearance and perspective in images, called the variational inferred transformational autoencoder (VITAE). As the name suggests, we focus on variational autoencod

16、ers as generative models, but the idea is general (Fig. 1). First we encode/decode the perspective features in order to extract an appearance that is perspective- invariant. This is then encoded into a second latent space, where inputs with similar appearance are encoded similarly. This process gene

17、rates an inductive bias that disentangles perspective and appearance. In practice, we develop an architecture that leverages the inference part of the model to guide the generator towards better disentanglement. We also show that this specifi c choice of architecture improves training stability with

18、 the right choice of parametrization of perspective factors. Experimentally, we demonstrate that our model on four datasets: standard disentanglement benchmark dSprites, disentanglement of style and content on MNIST, pose and shape on images of human bodies (Fig. 2) and facial features and facial sh

19、ape on CelebA. 2Related work Disentangled representations learning (DRL)have long been a goal in data analysis. Early work on non-negative matrix factorization Lee and Seung, 1999 and bilinear models Tenenbaum and Freeman, 2000 showed how images can be composed into semantic “parts” that can be glue

20、d together to form the fi nal image. Similarly, EigenFaces Turk and Pentland, 1991 have often been used to factor out lighting conditions from the representation Shakunaga and Shigenari, 2001, thereby discovering some of the physics that govern the world of which the data is a glimpse. This is centr

21、al in the long-standing argument that for an AI agent to understand and reason about the world, it must disentangle the explanatory factors of variation in data Lake et al., 2016. As such, DRL can be seen as a poor mans approximation to discovering the underlying causal factors of the data. Independ

22、ent componentsare, perhaps, the most stringent formalization of “disentanglement”. The seminal independent component analysis (ICA) Comon, 1994 factors the signal into statistically independent components. It has been shown that the independent components of natural images are edge fi lters Bell and

23、 Sejnowski, 1997 that can be linked to the receptive fi elds in the human brain Olshausen and Field, 1996. Similar fi ndings have been made for both video and audio van Hateren and Ruderman, 1998, Lewicki, 2002. DRL, thus, allows us to understand both the data and ourselves. Since independent factor

24、s are the optimal compression, ICA fi nds the most compact representation, implying that the predictive model can achieve maximal capacity from its parameters. This gives DLR a predictive perspective, and can be taken as a hint that a well-trained model might be disentangled. In 2 the linearcase, in

25、dependent components have many successful realizationsHyvrinen and Oja,2000, but in the general non-linear case, the problem is not identifi able Hyvrinen et al., 2018. Deep DRLwas initiated by Bengio et al. 2012 who sparked the current interest in the topic. One of the current state-of-the-art meth

26、ods for doing disentangled representation learning is the-VAE Higgins et al., 2017, that modifi es the variational autoencoder (VAE) Kingma and Welling, 2013, Rezende et al., 2014 to learn a more disentangled representation.-VAE enforces more weight on the KL-divergence in the VAE loss, thereby opti

27、mizing towards latent factors that should be axis aligned i.e. disentangled. Newer models like-TCVAE Chen et al., 2018 and DIP-VAE Kumar et al., 2017 extend-VAE by decomposing the KL-divergences into multiple terms, and only increase the weight on terms that analytically disentangles the models. Inf

28、oGAN Chen et al., 2016 extends the latent codezof the standard GAN model Goodfellow et al., 2014 with an extra latent codecand then penalize low mutual information between generated samplesG(c,z)andc. DC-IGN Kulkarni et al., 2015 forces the latent codes to be disentangled by only feeding in batches

29、of data that vary in one way (e.g. pose, light) while only having small disjoint parts of the latent code active. Shape statistics is the key inspiration for our work. The shape of an object was fi rst formalized by Kendall 1989 as being what is left of an object when translation, rotation and scale

30、 are factored out. That is, the intrinsic shape of an object should not depend on viewpoint. This idea dates, at least, back to DArcy Thompson 1917 who pioneered the understanding of the development of biological forms. In Kendalls formalism, the rigid transformations (translation, rotation and scal

31、e) are viewed as group actions to be factored out of the representation, such that the remainder is shape. Higgins et al. 2018 follow the same idea by defi ning disentanglement as a factoring of the representation into group actions. Our work can be seen as a realization of this principle within a d

32、eep generative model. When an object is represented by a set of landmarks, e.g. in the form of discrete points along its contour, then Kendalls shape space is a Riemannian manifold that exactly captures all variability among the landmarks except translation, rotation, and scale of the object. When t

33、he object is not represented by landmarks, then similar mathematical results are not available. Our work shows how the same idea can be realized for general image data, and for a much wider range of transformations than the rigid ones. Learned-Miller 2006 proposed a related linear model that generat

34、e new data by transforming a prototype, which is estimated by joint alignment. Transformationsare at the core of our method, and these leverage the architecture of spatial transformer nets (STNs) Jaderberg et al., 2015. While these work well within supervised learning, Lin and Lucey, 2016, Annunziat

35、a et al., 2018, Detlefsen et al., 2018 there has been limited uptake within generative models. Lin et al. 2018 combine a GAN with an STN to compose a foreground (e.g a furniture) into a background such that it look neutral. The AIR model Eslami et al., 2016 combines STNs with a VAE for object render

36、ing, but do not seek disentangled representations. In supervised learning, data augmentation is often used to make a classifi er partially invariant to select transformations Baird, 1992, Hauberg et al., 2016. 3Method Our goal is to extend a variational autoencoder (VAE) Kingma and Welling, 2013, Re

37、zende et al., 2014 such that it can disentangle appearance and perspective in data. A standard VAE assumes that data is generated by a set of latent variables following a standard Gaussian prior, p(x) = Z p(x|z)p(z)dz p(z) = N(0,Id), p(x|z) = N(x|p(z),2 p(z) or P(x|z) = B(x|p(z). (1) Datax is then g

38、enerated by fi rst sampling a latent variablezand then samplexfrom the conditional p(x|z) (often called the decoder). To make the model fl exible enough to capture complex data distributions,pand2 pare modeled as deep neural nets. The marginal likelihood is then intractable and a variational approxi

39、mation q to p(z|x) is needed, p(z|x) q(z|x) = N(z|q(x),2 q(x), (2) where q(x) and 2 q(x) are deep neural networks, see Fig. 3(a). When training VAEs, we therefore simultaneously train a generative modelp(x|z)p(z)and an inference modelq(z|x)(often called the encoder). This is done by maximizing a var

40、iational lower 3 (a) VAE(b) Unconditional VITAE(c) Conditional VITAE Figure 3:Architectures of standard VAE and our proposed U-VITAE and C-VITAE models. Hereq denotes encoders,pdenotes decoders,T denotes a ST-layer with transformation parameters. The dotted box indicates the generative model. bound

41、to the likelihood p(x) called the evidence lower bound (ELBO) logp(x) Eq(z|x) ? log p(x,z) q(z|x) ? = Eq(z|x)logp(x|z) |z data fi tting term KL(q(z|x)|p(z) |z regulazation term .(3) The fi rst term measures the reconstruction error betweenxandp(x|z)and the second measures the KL-divergence between t

42、he encoderq(z|x)and the priorp(z). Eq. 3 can be optimized using the reparametrization trick Kingma and Welling, 2013. Several improvements to VAEs have been proposed Burda et al., 2015, Kingma et al., 2016, but our focus is on the standard model. 3.1Incorporating an inductive bias To incorporate an

43、inductive bias that is able to disentangle appearance from perspective, we change the underlying generative model to rely on two latent factors zAand zP, p(x) = ZZ p(x|zA,zP)p(zA)p(zP)dzAdzP,(4) where we assume thatzAandzPboth follow standard Gaussian priors. Similar to a VAE, we also model the gene

44、rators as deep neural networks. To generate new datax, we combine the appearance and perspective factors using the following 3-step procedure that uses a spatial transformer (ST) layer Jaderberg et al., 2015 (dotted box in Fig. 3(b): 1. Sample zAand zPfrom p(z) = N(0,Id). 2. Decode both samples x p(

45、x|zA), p(x|zP). 3. Transform x with parameters using a spatial transformer layer: x = T( x). This process is illustrated by the dotted box in Fig. 3(b). Unconditional VITAE inference.As the marginal likelihood(4)is intractable, we use variational inference. A natural choice is to approximate each la

46、tent group of factorszA,zPindependently of the other i.e. p(zP|x) qP(zP|x) and p(zA|x) qA(zA|x).(5) The combined inference and generative model is illustrated in Fig. 3(b). For comparison, a VAE model is shown in Fig. 3(a). It can easily be shown that the ELBO for this model is merely a VAE with a K

47、L-term for each latent space (see supplements). 4 Conditional VITAE inference.This inference model does not mimic the generative process of the model, which may be suboptimal. Intuitively, we expect the encoder to approximately perform the inverse operation of the decoder, i.e.z encoder(decoder(z) d

48、ecoder1(decoder(z). Since the proposed encoder(5) does not include an ST-layer, it may be diffi cult to train an encoder to approximately invert the decoder. To accommodate this, we fi rst include an ST-layer in the encoder for the appearance factors. Secondly, we explicitly enforce that the predict

49、ed transformation in the encoderT e is the inverse of that of the decoderT d, i.e. T e = (T d)1 (more on invertibility in Sec. 3.2). The inference of appearance is now dependent on the perspective factor zP, i.e. p(zP|x) qP(zP|x) and p(zA|x) qA(zA|x,zP).(6) These changes to the inference architectur

50、e are illustrated in Fig. 3(c). It can easily be shown that the ELBO for this model is given by logp(x) EqA,qPlog(p(x|zA,zP)DKL(qP(zP|x)|p(zP)EqPDKL(qA(zA|x)|p(zA). (7) which resembles the standard ELBO with a additional term (derivation in supplementary material), corresponding to the second latent

51、 space. We will call both models variational inferred transfor- mational autoencoders (VITAE) and we will denote the fi rst model(5)as unconditional/U-VITAE and the second model(6)as conditional/C-VITAE. The naming comes from Eq. 5 and 6, wherezA is respectively unconditioned and conditioned onzP. E

52、xperiments will show that the conditional architecture is essential for inference (Sec. 4.2). 3.2Transformation classes Until now, we have assumed that there exists a class of transformationsTthat cap- tures the perspective factors in data.Clearly, the choice ofTdepends on the true factors underlyin

53、g the data, but in many cases an affi ne transformation should suffi ce. Figure 4: Random deformation fi eld of an affi ne transformation (top) compared to a CPAB (bottom). We clearly see thatCPABtransformationsoffers amush more fl exible and rich class of diffi omor- phic transformations. T(x) = Ax

54、 + b = ? 11 1213 212214 ?x y 1 # .(8) However, the C-VITAE model requires access to the in- verse transformationT 1. The inverse of Eq. 8 is given byT 1 (x) = A1x b, which only exist ifAhas a non-zero determinant. One, easily verifi ed, approach to secure invertibility is to parametrize the transfor

55、mation by two scale factorssx,sy, one rotation angle, one shear parametermand two translation parameters tx,ty: T(x) = ?cos() sin() sin()cos() ?1 m 01 ?s x 0 0sy ? + ?t x ty ? . (9) In this case the inverse is trivially T 1 (sx,sy,m,tx,ty)(x) = T( 1 sx, 1 sy,m,tx,ty)(x), (10) where the scale factors

56、 must be strictly positive. An easier and more elegant approach is to leverage the matrix exponential. That is, instead of parametrizing the transformation in Eq. 8, we instead parametrize the veloc- ity of the transformation T(x) = expm 11 1213 212214 000 #!x y 1 # .(11) The inverse2is thenT 1 = T.

57、 ThenTin Eq. 11 is aC -diffi omorphism (i.e. a differentiable invertible map with a differentiable inverse) Duistermaat and Kolk, 2000. Experiments show that diffeomorphic transformations stabilize training and yield tighter ELBOs (see supplements). 2Follows from T and Tbeing commuting matrices. 5 O

58、ften we will not have prior knowledge regarding which transformation classes are suitable for disentanglingthedata. Anaturalwayforwardisthentoapplyahighlyfl exibleclassoftransformations that are treated as “black-box”. Inspired by Detlefsen et al. 2018, we also consider transformations T using the h

59、ighly expressive diffi omorphic transformations CPAB from Freifeld et al. 2015. These can be viewed as an extension to Eq. 11: instead of having a single affi ne transformation parametrized by its velocity, the image domain is divided into smaller cells, each having their own affi ne velocity. The collection of local affi ne velocities can be eff

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論