版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Neural Networks for Machine LearningLecture 12aThe Boltzmann Machine learning algorithm,The goal of learning,We want to maximize the product of the probabilities that the Boltzmann machine assigns to the binary vectors in the training set. This is equivalent to maximizing the sum of the log probabil
2、ities that the Boltzmann machine assigns to the training vectors.,It is also equivalent to maximizing the probability that we would obtain exactly the N training cases if we did the following Let the network settle to its stationary distribution N different times with no external input. Sample the v
3、isible vector once each time.,w2 w3 w4,Why the learning could be difficult,Consider a chain of units with visible units at the ends If the training set consists of (1,0) and (0,1) we want the product of all the weights to be negative. So to know how to change w1 or w5 we must know w3.,hidden visible
4、,w1,w5,A very surprising fact,Everything that one weight needs to know about the other weights and the data is contained in the difference of two correlations.,Derivative of log probability of one training vector, v under the model.,Expected value of product of states at thermal equilibrium when v i
5、s clamped on the visible units,Expected value of product of states at thermal equilibrium with no clamping,Why is the derivative so simple?,The energy is a linear function of the weights and states, so: The process of settling to thermal equilibrium propagates information about the weights. We dont
6、need backprop.,The probability of a global configuration at thermal equilibrium is an exponential function of its energy. So settling to equilibrium makes the log probability a linear function of the energy.,Why do we need the negative phase?,The positive phase finds hidden configurations that work
7、well with v and lowers their energies. The negative phase finds the joint configurations that are the best competitors and raises their energies.,An inefficient way to collect the statistics required for learningHinton and Sejnowski (1983),Positive phase: Clamp a data vector on the visible units and
8、 set the hidden units to random binary states. Update the hidden units one at a time until the network reaches thermal equilibrium at a temperature of 1. Sample for every connected pair of units. Repeat for all data vectors in the training set and average.,Negative phase: Set all the units to random
9、 binary states. Update all the units one at a time until the network reaches thermal equilibrium at a temperature of 1. Sample for every connected pair of units. Repeat many times (how many?) and average to get good estimates.,Neural Networks for Machine LearningLecture 12bMore efficient ways to get
10、 the statisticsADVANCED MATERIAL: NOT ON QUIZZES OR FINAL TEST,A better way of collecting the statistics,If we start from a random state, it may take a long time to reach thermal equilibrium. Also, its very hard to tell when we get there. Why not start from whatever state you ended up in last time y
11、ou saw that datavector? This stored state is called a “particle”.,Using particles that persist to get a “warm start” has a big advantage: If we were at equilibrium last time and we only changed the weights a little, we should only need a few updates to get back to equilibrium.,Neals method for colle
12、cting the statistics (Neal 1992),Positive phase: Keep a set of “data-specific particles”, one per training case. Each particle has a current value that is a configuration of the hidden units. Sequentially update all the hidden units a few times in each particle with the relevant datavector clamped.
13、For every connected pair of units, average over all the data-specific particles.,Negative phase: Keep a set of “fantasy particles”. Each particle has a value that is a global configuration. Sequentially update all the units in each fantasy particle a few times. For every connected pair of units, ave
14、rage over all the fantasy particles.,Adapting Neals approach to handle mini-batches,Neals approach does not work well with mini-batches. By the time we get back to the same datavector again, the weights will have been updated many times. But the data-specific particle will not have been updated so i
15、t may be far from equilibrium.,A strong assumption about how we understand the world: When a datavector is clamped, we will assume that the set of good explanations (i.e. hidden unit states) is uni-modal. i.e. we restrict ourselves to learning models in which one sensory input vector does not have m
16、ultiple very different explanations.,The simple mean field approximation,If we want to get the statistics right, we need to update the units stochastically and sequentially. But if we are in a hurry we can use probabilities instead of binary states and update the units in parallel. To avoid biphasic
17、 oscillations we can use damped mean field.,An efficient mini-batch learning procedure for Boltzmann Machines (Salakhutdinov & Hinton 2012),Positive phase: Initialize all the hidden probabilities at 0.5. Clamp a datavector on the visible units. Update all the hidden units in parallel until convergen
18、ce using mean field updates. After the net has converged, record for every connected pair of units and average this over all data in the mini-batch.,Negative phase: Keep a set of “fantasy particles”. Each particle has a value that is a global configuration. Sequentially update all the units in each
19、fantasy particle a few times. For every connected pair of units, average over all the fantasy particles.,Making the updates more parallel,In a general Boltzmann machine, the stochastic updates of units need to be sequential. There is a special architecture that allows alternating parallel updates wh
20、ich are much more efficient: No connections within a layer. No skip-layer connections. This is called a Deep Boltzmann Machine (DBM) Its a general Boltzmann machine with a lot of missing connections.,visible,Making the updates more parallel,In a general Boltzmann machine, the stochastic updates of u
21、nits need to be sequential. There is a special architecture that allows alternating parallel updates which are much more efficient: No connections within a layer. No skip-layer connections. This is called a Deep Boltzmann Machine (DBM) Its a general Boltzmann machine with a lot of missing connection
22、s.,visible,? ?,? ?,? ?,? ? ?,Can a DBM learn a good model of the MNIST digits?,Do samples from the model look like real data?,A puzzle,Why can we estimate the “negative phase statistics” well with only 100 negative examples to characterize the whole space of possible configurations? For all interest
23、ing problems the GLOBAL configuration space is highly multi-modal. How does it manage to find and represent all the modes with only 100 particles?,The learning raises the effective mixing rate.,The learning interacts with the Markov chain that is being used to gather the “negative statistics” (i.e.
24、the data-independent statistics). We cannot analyse the learning by viewing it as an outer loop and the gathering of statistics as an inner loop.,Wherever the fantasy particles outnumber the positive data, the energy surface is raised. This makes the fantasies rush around hyperactively. They move ar
25、ound MUCH faster than the mixing rate of the Markov chain defined by the static current weights.,How fantasy particles move between the models modes,If a mode has more fantasy particles than data, the energy surface is raised until the fantasy particles escape. This can overcome energy barriers that
26、 would be too high for the Markov chain to jump in a reasonable time. The energy surface is being changed to help mixing in addition to defining the model. Once the fantasy particles have filled in a hole, they rush off somewhere else to deal with the next problem. They are like investigative journa
27、lists.,This minimum will get filled in by the learning until the fantasy particles escape.,Neural Networks for Machine LearningLecture 12cRestricted Boltzmann Machines,Restricted Boltzmann Machines,We restrict the connectivity to make inference and learning easier. Only one layer of hidden units. No
28、 connections between hidden units. In an RBM it only takes one step to reach thermal equilibrium when the visible units are clamped. So we can quickly get the exact value of :,hidden visible,i,j,PCD: An efficient mini-batch learning procedure for Restricted Boltzmann Machines (Tieleman, 2008),Positi
29、ve phase: Clamp a datavector on the visible units. Compute the exact value of for all pairs of a visible and a hidden unit. For every connected pair of units, average over all data in the mini-batch.,Negative phase: Keep a set of “fantasy particles”. Each particle has a value that is a global config
30、uration. Update each fantasy particle a few times using alternating parallel updates. For every connected pair of units, average over all the fantasy particles.,A picture of an inefficient version of the Boltzmann machine learning algorithm for an RBM,i,j,i,i,j,i,j,t = 0,Start with a training vector
31、 on the visible units. Then alternate between updating all the hidden units in parallel and updating all the visible units in parallel.,a fantasy,j,t = 1,t = 2,t = infinity,Contrastive divergence: A very surprising short-cut,t = 0 t = 1,Start with a training vector on the visible units. Update all t
32、he hidden units in parallel. Update the all the visible units in parallel to get a “reconstruction”. Update the hidden units again.,This is not following the gradient of the log likelihood. But it works well.,reconstruction,data,i,j,i,j,Why does the shortcut work?,If we start at the data, the Markov
33、 chain wanders away from the data and towards things that it likes more. We can see what direction it is wandering in after only a few steps. When we know the weights are bad, it is a waste of time to let it go all the way to equilibrium.,All we need to do is lower the probability of the confabulati
34、ons it produces after one full step and raise the probability of the data. Then it will stop wandering away. The learning cancels out once the confabulations and the data have the same distribution.,A picture of contrastive divergence learning,Change the weights to pull the energy down at the datapo
35、int. Change the weights to pull the energy up at the reconstruction.,datapoint + hidden(datapoint),reconstruction + hidden(reconstruction),E,Energy surface in space of global configurations.,E,When does the shortcut fail?,We need to worry about regions of the data-space that the model likes but whic
36、h are very far from any data. These low energy holes cause the normalization term to be big and we cannot sense them if we use the shortcut. Persistent particles would eventually fall into a hole, cause it to fill up then move on to another hole.,A good compromise between speed and correctness is to
37、 start with small weights and use CD1 (i.e. use one full step to get the “negative data”). Once the weights grow, the Markov chain mixes more slowly so we use CD3. Once the weights have grown more we use CD10.,Neural Networks for Machine LearningLecture 12dAn example of Contrastive Divergence Learni
38、ng,How to learn a set of features that are good for reconstructing images of the digit 2,50 binary neurons that learn features,16 x 16 pixel image,Increment weights between an active pixel and an active feature,Decrement weights between an active pixel and an active feature,data (reality),reconstruc
39、tion (better than reality),50 binary neurons that learn features,16 x 16 pixel image,The weights of the 50 feature detectors,We start with small random weights to break symmetry,The final 50 x 256 weights: Each neuron grabs a different feature,Reconstruction from activated binary features,Data,Recon
40、struction from activated binary features,Data,How well can we reconstruct digit images from the binary feature activations?,New test image from the digit class that the model was trained on,Image from an unfamiliar digit class The network tries to see every image as a 2.,Some features learned in the
41、 first hidden layer of a model of all 10 digit classes using 500 hidden units.,Neural Networks for Machine LearningLecture 12eRBMs for collaborative filtering,Collaborative filtering: The Netflix competition,You are given most of the ratings that half a million Users gave to 18,000 Movies on a scale
42、 from 1 to 5. Each user only rates a small fraction of the movies. You have to predict the ratings users gave to the held out movies. If you win you get $1000,000,Lets use a “l(fā)anguage model”,The data is strings of triples of the form: User, Movie, rating. U2 M1 5 U2 M3 1 U4 M1 4 U4 M3 ? All we have to do is to predict the next “word” well and we will ge
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 丁苯橡膠裝置操作工崗前競(jìng)爭(zhēng)分析考核試卷含答案
- 2025呼倫貝爾扎蘭屯市中小學(xué)教師競(jìng)爭(zhēng)性比選62人備考題庫(kù)附答案
- 淀粉加工工崗前安全文明考核試卷含答案
- 玻璃鋼制品噴射工安全文化水平考核試卷含答案
- 電工合金熔煉及熱變形工安全風(fēng)險(xiǎn)能力考核試卷含答案
- 地毯設(shè)計(jì)師崗前設(shè)備考核試卷含答案
- 炭素壓型工誠(chéng)信道德模擬考核試卷含答案
- 玻纖制品后處理工崗前技術(shù)基礎(chǔ)考核試卷含答案
- 2024年黑龍江省特崗教師招聘真題匯編附答案
- 2024年豫章師范學(xué)院輔導(dǎo)員考試筆試真題匯編附答案
- 鐵凝《沒(méi)有紐扣的紅襯衫》閱讀答案
- 公路工地試驗(yàn)室安全培訓(xùn)課件
- 2025年南京市事業(yè)單位教師招聘考試體育學(xué)科專業(yè)知識(shí)試卷(夏季卷)
- 人教版八年級(jí)英語(yǔ)上冊(cè)期末復(fù)習(xí):完形填空15篇(含答案)
- 建筑消防設(shè)施介紹
- 圖書(shū)館志愿者培訓(xùn)課件
- 2025年特種設(shè)備作業(yè)人員考試壓力管道安全操作試題
- 足細(xì)胞損傷與糖尿病腎病病理機(jī)制研究
- 2026年高考政治一輪復(fù)習(xí):選擇性必修3《邏輯與思維》知識(shí)點(diǎn)復(fù)習(xí)提綱
- 結(jié)腸癌和直腸癌中西醫(yī)結(jié)合診療指南
- 產(chǎn)業(yè)園項(xiàng)目弱電智能化規(guī)劃方案
評(píng)論
0/150
提交評(píng)論