版權說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權,請進行舉報或認領
文檔簡介
1、Initialization of ReLUs for Dynamical Isometry Rebekka Burkholz Department of Biostatistics Harvard T.H. Chan School of Public Health 655 Huntington Avenue, Boston, MA 02115 Alina Dubatovka Department of Computer Science ETH Zurich Universittstrasse 6, 8092 Zurich alina.dub
2、atovkainf.ethz.ch Abstract Deep learning relies on good initialization schemes and hyperparameter choices prior to training a neural network. Random weight initializations induce random network ensembles, which give rise to the trainability, training speed, and some- times also generalization abilit
3、y of an instance. In addition, such ensembles provide theoretical insights into the space of candidate models of which one is selected during training. The results obtained so far rely on mean fi eld approximations that assume infi nite layer width and that study average squared signals. We derive t
4、he joint signal output distribution exactly, without mean fi eld assumptions, for fully-connected networks with Gaussian weights and biases, and analyze deviations from the mean fi eld results. For rectifi ed linear units, we further discuss limitations of the standard initialization scheme, such as
5、 its lack of dynamical isometry, and propose a simple alternative that overcomes these by initial parameter sharing. 1Introduction Deep learning relies critically on good parameter initialization prior to training. Two approaches are commonly employed: random network initialization 4,7,14 and transf
6、er learning 26 (including unsupervised pre-training), where a network that was trained for a different task or a part of it is retrained and extended by additional network layers. While the latter can speed up training considerably and also improve the generalization ability of the new model, its bi
7、as towards the original task can also hinder successful training if the learned features barely relate to the new task. Random initialization of parameters, meanwhile, requires careful tuning of the distributions from which neural network weights and biases are drawn. While heterogeneity of network
8、parameters is needed to produce meaningful output, a too big variance can also dilute the original signal. To avoid exploding or vanishing gradients, the distributions can be adjusted to preserve signal variance from layer to layer. This enables the training of very deep networks by simple stochasti
9、c gradient descent (SGD) without the need of computationally intensive corrections as batch normalization 8 or variants thereof 12 . This approach is justifi ed by the similar update rules of gradient back-propagation and signal forward propagation 20. In addition to trainability, good parameter ini
10、tializations also seem to support the generalization ability of the trained, overparametrized network. According to 3, the parameter values remain close to the initialized ones, which has a regularization effect. An early example of approximate signal variance preservation is proposed in 4 for fully
11、 connected feed forward neural networks, an important building block of most common neural architectures. Inspired by those derivations, He et al. 7 found that for rectifi ed linear units (ReLUs) and Gaussian weight initializationw N(,2)the optimal choice is zero mean = 0, variance2= 2/Nand zero bia
12、sb = 0, whereN refers to the number of neurons in a layer. These fi ndings are confi rmed by mean fi eld theory, which assumes infi nitely wide network layers to employ the central limit theorem and focus on normal distributions. Similar results have been obtained fortanh16,18,20, residual networks
13、with different activation functions 24, and convolutional neural networks 23. The same 33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada. derivations also lead to the insight that infi nitely wide fully-connected neural networks approximately learn the kernel
14、 of a Gaussian process 11. According to these works, not only the signal variance but also correlations between signals corresponding to different inputs need to be preserved to ensure good trainability of initialized neural networks. This way, the average eigenvalue of the signal input- output Jaco
15、bian in mean fi eld neural networks is steered towards1. Furthermore, a high concentration of the full spectral density of the Jacobian close to1seems to support higher training speeds 14,15. This property is called dynamical isometry and is better realized by orthogonal weight initializations 19 .
16、So far, these insights rely on the mean fi eld assumption of infi nite layer width. 6,5 have derived fi nite size corrections for the average squared signal norm and answered the question when the mean fi eld assumption holds. In this article, we determine the exact signal output distribution withou
17、t requiring mean fi eld ap- proximations. For fully-connected network ensembles with Gaussian weights and biases for general nonlinear activation functions, we fi nd that the output distribution only depends on the scalar products between different inputs. We therefore focus on their propagation thr
18、ough a network ensemble. In particular, we study a linear transition operator that advances the signal distribution layer-wise. We conjecture that the spectral properties of this operator can be more informative of trainability than the average spectral density of the input-output Jacobian. Addition
19、ally, the distribution of the cosine similarity indicates how well an initialized network can distinguish different inputs. We further discuss when network layers of fi nite width are well represented by mean fi eld analysis and when they are not. Furthermore, we highlight important differences in t
20、he analysis. By specializing our derivations to ReLUs, we fi nd variants of the He initialization 7 that fulfi ll the same criteria but also suffer from the same lack of dynamical isometry 14. In consequence, such initialized neural networks cannot be trained effectively without batch normalization
21、for high depth. To overcome this problem, we propose a simple initialization scheme for ReLU layers that guarantees perfect dynamical isometry. A subset of the weights can still be drawn from Gaussian distributions or chosen as orthogonal while the remaining ones are designed to ensure full signal p
22、ropagation. Both consistently outperform the He initialization in our experiments on MNIST and CIFAR-10. 2Signal propagation through Gaussian neural network ensembles 2.1Background and notation We study fully-connected neural network ensembles with zero mean Gaussian weights and biases. We thus make
23、 the following assumption: An ensembleGL,Nl,w,bof fully-connected feed forward neural networks consists of networks with depthsL, widthsNl,l = 0,.,L, independently normally distributed weights and biases with w(l) ij N ? 0,2 w,l ? ,b(l) i N ? 0,2 b,l ? , and non-decreasing activation function : R R.
24、 Starting from the input vector x(0), signal x(l)propagates through the network, as usual, as: x(l)= ? h(l) ? ,h(l)= W(l)x(l1)+ b(l), x(l) i = ? h(l) i ? ,h(l) i = Nl1 X j=1 w(l) ij x(l1) j + b(l) i , forl = 1,.,L, whereh(l)is the pre-activation at layerl,W(l)is the weight matrix, andb(l)is the bias
25、 vector. If not indicated otherwise,1-dimensional functions applied to vectors are applied to each component separately. To ease notation, we follow the convention to suppress the superscript(l)and write, for instance,xiinstead ofx(l) i ,xiinstead ofx(l1) i , andxiinstead ofx(l+1) i , when the layer
26、 reference is clear from the context. Ideally, the initialized network is close to the trained one with high probability and can be reached fast in a small number of training steps. Hence, our fi rst goal is to understand the ensemble above and the trainability of an initialized network without requ
27、iring mean fi eld approximations of infi nite Nl. In particular, we derive the probability distribution of the outputx(L). Within this framework, our second goal is to learn how to improve on the He initialization, i.e., the choicew,l= p2/N l andb(l) i = 0. Even though it preserves the variance for
28、ReLUs, i.e.,(x) = max0,x, as activation 2 functions 7, neither this parameter choice nor orthogonal weights lead to dynamical isometry 14. Thus, the average spectrum of the input-output Jacobian is not concentrated around1for higher depths and infi nite width. In consequence, ReLUs are argued to be
29、an inferior choice compared to sigmoids 14. Thus, our third goal is to provide an initialization scheme for ReLUs that overcomes the resulting problems and provides dynamical isometry. We start with our results about the signal propagation for general activation functions. The proofs for all theorem
30、s are given in the supplementary material. As we show, the signal output distribution depends on the input distribution only via scalar products of the inputs. Higher order terms do not propagate through a network ensemble at initialization. In consequence, we can focus on the distribution of such s
31、calar products later on to derive meaningful criteria for the trainability of initialized deep neural networks. 2.2General activation functions Lets fi rst assume that the signalxof the previous layer is given. Then, each pre-activation component hiof the current layer is normally distributed ashi=
32、PNl j=1wijxj + bi N ? 0,2 w P jx 2 j+ 2b ? , since the weights and bias are independently normally distributed with zero mean. The non-linear monotonically increasing transformationxi= (hi)is distributed asxi ? 1() ? , where1 denotes the generalized inverse of, i.e.1(x) := infy R | (y) x,the cumulat
33、ive distribution function (cdf) of a standard normal random variable, and2= 2 w|x|2+ 2b. Thus, we only need to know the distribution of|x|2as input to compute the distribution ofxi. The signal propagation is thus reduced to a 1-dimensional problem. Note that the assumption of equal2 wfor all incomin
34、g edges into a neuron are crucial for this result. Otherwise, hi N ? 0, P j 2 w,jx2j+ 2b,i ? would require the knowledge of the distribution of P j 2 w,jx2j, which depends on the parameters 2 w,j. Based on 2w,j = 2 whowever, we can compute the probability distribution of outputs. Proposition 1.Let t
35、he probability densityp0(z)of the squared input norm|x(0)|2= PN0 i=1 ? x(0) i ?2 be known. Then, the distributionpl(z)of the squared signal norm|x(l)|2depends only on the distribution of the previous layerpl1(z)as transformed by a linear operatorTl: L1(R+) L1(R+) so that pl= Tl(pl1). Tl is defi ned
36、as Tl(p)z = Z 0 kl(y,z)p(y) dy,(1) wherek(y,z)is the distribution of the squared signalzat layerlgiven the squared signal at the previous layeryso thatkl(y,z) = pNl1 (hy)2(z), where stands for convolution andp(hy)2(z) denotes the distribution of the squared transformed pre-activationhy, which is nor
37、mally distributed ashy N ?0,2 wy2+ 2b ?. This distribution serves to compute the cumulative distribution function (cdf) of each signal component x(l) i as Fx(l) i (x) = Z 0 dz pl1(z) 1(x) p2 wz + 2b ! ,(2) where1denotes the generalized inverse ofandthe cdf of a standard normal random variable. Accor
38、dingly, the components are jointly distributed as Fx(l) 1 ,.,x(l) Nl (x) = Z 0 dz pl1(z)Nl i=1 ?1(x i) z ? ,(3) where we use the abbreviation z= p2 wz + 2b. As common, theN-fold convolution of a functionf L1(R+) is defi ned as repeated convolution withf, i.e., by induction,fN(z) = f f(N1)(z) = Rz 0
39、f(x)f(N1)(z x) dx. In Prop. 1, we note the radial symmetry of the output distribution. It only depends on the squared norm of the input. For a single inputx(0),p0(z)is given by the indicator functionp0(z) = 1|x(0)|2(z). Interestingly, 3 G G G G G G G GG G G G G G GGG GG G GGG G GG G G GG G GG G G G
40、G GGG G GG GG G G G G GG GGG G G G G G GG G GG G G G G G G G GG G G G G G GG G GGG G G G G G GGG G G G G G G G G G G G GGGG GGGG G G GGG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
41、 G G G G G G G G G G GGGGGG GG GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G G GG GG GGGGGGGG GG GG G G G G G G G G GG G G G G G G G G G G G G G G G G GG GG GG GG GG GG GGG GGG GGGG GGGGGG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
42、 G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGGG GG GG GG G G G G G G G G G G G G G G G G G G G G G G G G G G GG G GGG GGGG GGG GG G GG GG G G G G GG G G GG G G G G G G G G G G G GG G GG GG GG GG GG GG GG GGG GGG GGGG GGGGG GGGGGG G G G G G G G G G G G G G G
43、G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGGG GG GG G G G G G G G G G G G G G G G G G G G G G G G G G G G GG GGGG G G G G GGG GG GG G GG GG G G G G GG GG GG GG G G G G GG G GG GG GG GG GG GG GG GG GG
44、G GGG GGG GGGG GGGGG GGGGGG GGGG G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGGG GG GG G G G G G G G G G G G G G G G G G G G G G G G GG G GG GG GGGGGGG GG GG GG GG G GG GG GG GG GG GG
45、GG GG G GG GG GG GG GG GG GG GG GG GG GG GGG GGG GGG GGGG GGGGG GGGGGG GGGG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGGG GG G G G G G G G G G G G G G G G G G G G G G G G G G GG GG
46、G G G GG G G G GGG GG G GG GG G GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GGG GGG GGG GGG GGG GGGG GGGGG GGGGGG GGGGG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGGG GG G G G G G
47、G G G G G G G G G G G G G G G G G GG GG GGG G G GGGGG G G GGG GG G GG GG G GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GG GGG GGG GGG GGG GGGG GGGG GGGGG GGGGGG GGGG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
48、 G G G G G G G G G G G G GGG GG G G G G G G G G G G G G G G G G G G G G G G G GG GG G GG G G GGG GGGG GG GG GG GG GG G GG GG GG GG GG GG GG GG GG GG GG GG GG GG GGG GG GGG GGG GGG GGGG GGGG GGGG GGGGG GGGGGGG G G G G G G G G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G
49、G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGG G G G G G G G G G G G G G G G G G G G G G G GG GG GGG GG G G G G GGG GGG GG GG GG GG GG G GG GG GG GG GG GG GG GG GG GG GG GG GG GG GGG GGG GGG GGG GGG GGGG GGGGG GGGGG GGGGGG G G GG G G G G G G G G GG G G G G G G G G G G G G G
50、G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G GGG GG G G G G G G G G G G G G G G G G G G G GG GG GGG GGGG G G GGG G GGG G GG GG GG GG GG G GG GG GG GG GG GG GG GG GG GG GGG GG GGG GGG GGG GGG GGGG GGGG GGGG GGGGG GGGGGG GGG G G GG G G G
51、G GG G G GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 0.000 0.005 0.010 0.015 0.020 05101520 |x|2 P L G G G G G G G G G G 0 1 2 3 4 5 6 7 8 9 (a) Squared signal norm distribution at different depths forNl= 200. T
52、he initial distribution (L = 0) is defi ned by MNIST. G G G G G G G G G G G G G G G GG GG GG GGG GGGG GGGGGG GGGGGGGGGGGGG GGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGG GGGGGGGGG GGGGGG GGGG GGGG GGG GGG GG GG GG GG G G G G G G G G G G G G G G G G G G G G G G G G G G G G G 0
53、.0 0.5 1.0 1.5 50403020100 m (b) Eigenvaluescorresponding to eigenfunctions ymofTl.Nl= 10(black circles),Nl= 20(blue triangles), Nl= 100 (red +). Figure 1: Layer-wise transition of the squared signal norm distribution for ReLUs with He initializa- tion parameters w= p2/N l, b= 0. mean fi eld analysi
54、s also focuses on the average or the squared signal, which is likewise updated layer-wise. Prop. 1 explains and justifi es the focus of mean fi eld theory on the squared signal norm. More information is not transmitted from layer to layer to determine the state (distribution) of a single neuron. The
55、 difference to mean fi eld theory here is that we regard the full distributionpl1of the previous layer instead of its average only on infi nitely large layers. The linear operatorTlgoverns this distribution.px(L)= QL l=1Tlpx(0) , where the product is defi ned by function composition. Hence, the line
56、ar operator QL l=1Tl can also be interpreted as the Jacobian corresponding to the (linear) function that maps the squared input norm distribution to the squared output norm distribution.Tlis different from the signal input output Jacobian studied in mean fi eld random matrix theory, yet, its spectra
57、l properties can also inform us about the trainability of the network ensemble. Conveniently, we only have to study one spectrum and not a distribution of eigenvalues that are potentially coupled as in random matrix theory. For any nonlinear activation function,Tlcan be approximated numerically on a
58、n equidistant grid. The convolution in the kernel defi nition can be computed effi ciently with the help of Fast Fourier Transformations. The eigenvalues of the matrix approximatingTl defi ne the approximate signal propagation along the eigendirections. However, we only receive the full picture when
59、 we extend our study to look at the joint output distribution, i.e., the outputs corresponding to different inputs. Proposition 2.The same component of pre-activations of signalsh1,.,hDcorresponding to different inputsx(0) 1 ,.,x(0) D , are jointly normally distributed with zero mean and covariance matrix V defi ned by vij= Cov(hi,hj) = 2 whxi,xji + 2 b (4) fori,j = 1,.,Dconditional on the signalsxiof the previo
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026屆銀川市重點中學高三英語第一學期期末達標測試試題含解析
- 票據(jù)管理制度適用范圍(3篇)
- 藥品紙箱管理制度范本(3篇)
- 設計工時管理制度范本(3篇)
- 輔材配件管理制度范本(3篇)
- 野生種質(zhì)資源圃管理制度(3篇)
- 防疫臨時駐場人員管理制度(3篇)
- 食品品質(zhì)責任管理制度內(nèi)容(3篇)
- 疾病預防與安全應急 溺水的預防與急救 課件2025-2026學年人教版初中+體育與健康七年級全一冊
- 中學學生社團財務管理制度
- 2026年藥店培訓計劃試題及答案
- 2026春招:中國煙草真題及答案
- 六年級寒假家長會課件
- 物流鐵路專用線工程節(jié)能評估報告
- 2026河南省氣象部門招聘應屆高校畢業(yè)生14人(第2號)參考題庫附答案
- 2026天津市南開區(qū)衛(wèi)生健康系統(tǒng)招聘事業(yè)單位60人(含高層次人才)備考核心試題附答案解析
- 2025江蘇無錫市宜興市部分機關事業(yè)單位招聘編外人員40人(A類)備考筆試試題及答案解析
- 卵巢過度刺激征課件
- 漢服行業(yè)市場壁壘分析報告
- 重瞼手術知情同意書
- 2026華潤燃氣校園招聘(公共基礎知識)綜合能力測試題附答案解析
評論
0/150
提交評論