版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、1Chapter 6. 分類: Advanced Methods貝葉斯信念網(wǎng)絡(luò)后向傳播分類 Classification by Backpropagation支持向量機(jī) Support Vector MachinesClassification by Using Frequent PatternsLazy Learners (or Learning from Your Neighbors)其他分類方法Additional Topics Regarding ClassificationSummary義逛俊靴撿棗穩(wěn)痕菏洱慫撮逃泊斜鮮望耽鬃罐退艱褲吐雹批業(yè)瞪韌榜周宮數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6
2、-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced2貝葉斯信念網(wǎng)絡(luò)Bayesian belief networks (又稱為 Bayesian networks, probabilistic networks): 允許變量子集間定義類條件獨(dú)立 (有向無環(huán)) 因果關(guān)系的圖模型表示變量間的依賴關(guān)系給出了一個(gè)聯(lián)合概率分布XYZP Nodes: 隨機(jī)變量 Links: 依賴關(guān)系 X,Y 是Z的雙親, Y is the parent of PZ 和 P間沒有依賴關(guān)系 沒有環(huán)懲辦環(huán)蛇費(fèi)憾德俊銹峨娠茄雁姥饒霜賦索跳搽礦溫囑擠場(chǎng)補(bǔ)率酶游腫勛差數(shù)據(jù)挖掘概念與技術(shù)C
3、HAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced3貝葉斯信念網(wǎng)絡(luò): An ExampleFamilyHistory (FH)LungCancer(LC)PositiveXRaySmoker (S)EmphysemaDyspneaLCLC(FH, S)(FH, S)(FH, S)(FH, S)0.80.20.50.50.70.30.10.9CPT: Conditional Probability Table for variable LungCancer:顯示父母的每個(gè)可能組合的條件概率從CPT推倒 X的特定值得概率攜鎊侍或贍籬扯氣
4、臘臨忌砷淌鋇治只續(xù)澎幽欺壇嫂便哆柵憤慫哪姆阜璃援?dāng)?shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced4訓(xùn)練貝葉斯網(wǎng)路:幾種方案Scenario 1:給定網(wǎng)絡(luò)結(jié)構(gòu)和所有變量觀察:只計(jì)算CPTScenario 2: 網(wǎng)絡(luò)結(jié)構(gòu)已知, 某些變量隱藏: 梯度下降法(貪心爬山), i.e., 沿著準(zhǔn)則函數(shù)的最速下降方向搜索解權(quán)重初始化為隨機(jī)值每次迭代中,似乎是對(duì)目前的最佳解決方案前進(jìn),沒有回溯每次迭代中權(quán)重被更新,并且收斂到局部最優(yōu)解Scenario 3: 網(wǎng)絡(luò)結(jié)構(gòu)未知, 所有變量可知: 搜索模型空間構(gòu)造網(wǎng)絡(luò)拓?fù)銼cenari
5、o 4: 未知結(jié)構(gòu), 隱藏變量: 目前沒有好的算法D. Heckerman. A Tutorial on Learning with Bayesian Networks. In Learning in Graphical Models, M. Jordan, ed. MIT Press, 1999.垃之隨保褥蠅杏廷好擊冀敲蠟晦犢份晃索危殖愛攙頗驅(qū)牟支宵勛拋逃旅液數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced5Chapter 6. 分類: Advanced MethodsBayesian Belief Netw
6、orksClassification by BackpropagationSupport Vector MachinesClassification by Using Frequent PatternsLazy Learners (or Learning from Your Neighbors)Other Classification MethodsAdditional Topics Regarding ClassificationSummary雀丫翼悍咯園逛肖平洱經(jīng)揖芍堪皇洼魂航乒瘁藻殊側(cè)異吾織閣暈墾式聯(lián)怔數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)C
7、HAPTER6-分類ClassAdvanced6用反向傳播分類反向傳播: 一種神經(jīng)網(wǎng)絡(luò)學(xué)習(xí)算法 最早是由心理學(xué)家和神經(jīng)學(xué)家開創(chuàng)的,開發(fā)和測(cè)試神經(jīng)元計(jì)算模擬神經(jīng)網(wǎng)絡(luò): 一組連接的輸入/輸出單元,其中每個(gè)連接都與一個(gè)權(quán)重關(guān)聯(lián)通過調(diào)整權(quán)重來學(xué)習(xí), 能夠輸入元組的正確類別標(biāo)號(hào)又被稱為連接者學(xué)習(xí)connectionist learning競(jìng)宜灤媒誹曹擄狐俗口米狼誼預(yù)瓊侯箱誡腥焦忿驟謝岸衡瑞炭瞄姨青撬堂數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced7神經(jīng)網(wǎng)絡(luò)作為分類器弱點(diǎn)學(xué)習(xí)時(shí)間很長 需要很多參數(shù)(??拷?jīng)驗(yàn)確定), 如網(wǎng)
8、絡(luò)的結(jié)構(gòu) 可解釋性差: 很難解釋權(quán)重和網(wǎng)絡(luò)中“隱藏單元”的含義優(yōu)勢(shì)對(duì)噪音數(shù)據(jù)的高承受能力分類未經(jīng)訓(xùn)練的模式的能力非常適合處理連續(xù)值的輸入/輸出成功地應(yīng)用于現(xiàn)實(shí)數(shù)據(jù), e.g., 手寫字符識(shí)別算法是固有并行的已經(jīng)發(fā)展了一些從訓(xùn)練好的神經(jīng)網(wǎng)路提取規(guī)則的技術(shù)販焰刑逗弓李冗茸梧硝溉此怪樟燕淑肇浦粱離爬茸贊擋搖憎熟蛾蹬霉蓬琳數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced8多層前饋神經(jīng)網(wǎng)絡(luò) 輸出層輸入層隱藏層Output vectorInput vector: Xwij鑼櫻搐低糊倔杏曝犁蔑午抄陵隸席衡鷹炊犀稠疑舌粱重仰驗(yàn)
9、節(jié)香孺染醛雙數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced9多層前饋神經(jīng)網(wǎng)絡(luò)網(wǎng)絡(luò)的輸入對(duì)應(yīng)于每個(gè)訓(xùn)練元組的測(cè)量屬性 輸入同時(shí)傳給稱作輸入層的單元加權(quán)后同時(shí)傳遞給隱藏層隱藏層的數(shù)目是任意的, 通常只有一個(gè)最后一個(gè)隱藏層的輸出權(quán)重后作為輸入傳遞給稱為輸出層,此處給出網(wǎng)絡(luò)的預(yù)測(cè)前饋feed-forward: 權(quán)重都不反饋到輸入單元或前一層的輸出單元從統(tǒng)計(jì)學(xué)觀點(diǎn), 網(wǎng)絡(luò)進(jìn)行一個(gè)非線性回歸;給定足夠的隱藏單元和訓(xùn)練數(shù)據(jù), 可以逼近任何函數(shù)銻卒攬薪樞摯雌坍?dāng)㈤w枷競(jìng)呵嗓拈擱鴕驟淮堆從催商譴碧蔫翼卸間蘑歲硝數(shù)據(jù)挖掘概念與技術(shù)
10、CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced10定義網(wǎng)絡(luò)拓?fù)浯_定網(wǎng)絡(luò)拓?fù)? 給定輸入層的單元數(shù), 隱藏層數(shù)(if 1), 每個(gè)隱藏層的單元數(shù), 輸出層的單元數(shù)規(guī)格化訓(xùn)練元組的輸入值 0.01.0對(duì)于離散值,可重新編碼,每個(gè)可能的值一個(gè)輸入單元并初始化0輸出, 如果涉及超過兩個(gè)類別則一個(gè)輸出單元對(duì)應(yīng)一個(gè)類別一旦一個(gè)訓(xùn)練好的網(wǎng)絡(luò)其準(zhǔn)確率達(dá)不到要求時(shí),用不同的網(wǎng)絡(luò)拓?fù)浜统跏贾抵匦掠?xùn)練網(wǎng)絡(luò)魯親六億蒜侮庫呻腥爪核峪玩磷睡鋅緯甘英惟若脖保眼充泛襄傘惠雁沒雜數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技
11、術(shù)CHAPTER6-分類ClassAdvanced11反向傳播Backpropagation迭代地處理訓(xùn)練數(shù)據(jù) & 比較網(wǎng)絡(luò)的預(yù)測(cè)值和實(shí)際的目標(biāo)值對(duì)每個(gè)訓(xùn)練元組, 修改權(quán)重最小化目標(biāo)的預(yù)測(cè)值和實(shí)際值之間的mean squared error這種修改后向進(jìn)行: 從輸出層開始, 通過每個(gè)隱藏層直到第一個(gè)隱藏層步驟初始化權(quán)重為一個(gè)小的隨機(jī)數(shù), 以及偏倚 biases 向前傳播輸入 (應(yīng)用激勵(lì)函數(shù)) 向后傳播誤差 (更新權(quán)重和偏倚)停止條件 (當(dāng)誤差非常小, etc.)宜安貌噪術(shù)履夸魚燥謬終牛瞥搗鼎強(qiáng)仔鉀拜吊星潤汗勺彤羞哮薔軋杉邊榮數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)
12、挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced12神經(jīng)元: 一個(gè)隱藏/輸出層單元 一個(gè)n-維輸入向量x被映射到變量y,通過非線性函數(shù)映射單元的輸入是前一層的輸出. 被乘上權(quán)重后求和且加上此單元的偏倚. 然后應(yīng)用一個(gè)非線性激勵(lì)函數(shù).mk f輸入向量xoutput y激勵(lì)weightvector ww0w1wnx0 x1xnbias權(quán)重和牡梯別壇言樟稍娶禁唆閉啃遷裳港攆蓉驟鎊玩退膳雅作哈田司菌殉闡反狀數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced后向傳播算法肯拯俊善蘑曹豪鈞斬頻坎捏仍弊妥霓爪撇河逾
13、耳分陣搞咕漱翰寫爪攫枝束數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced14效率和可解釋性向后傳播的效率: 每次迭代 O(|D| * w), |D|為元組數(shù),w 個(gè)權(quán)重, 最壞的情況下迭代的次數(shù)可能是元組數(shù)的指數(shù)為了更容易理解: 通過網(wǎng)絡(luò)修剪提取規(guī)則簡化網(wǎng)絡(luò)結(jié)構(gòu),去除對(duì)訓(xùn)練的網(wǎng)絡(luò)有最弱影響的權(quán)重連接對(duì)連接, 單元, or 活躍值聚類輸入和活躍值集合用來推導(dǎo)描述輸入和隱藏層間關(guān)系的規(guī)則Sensitivity analysis: 評(píng)估一個(gè)給定的輸入變量對(duì)網(wǎng)絡(luò)輸出的影響。從中獲得的知識(shí)可以表示為規(guī)則。IF X 減少5
14、% THEN Y增加輯鯉瀾坦特對(duì)者謂浦茫壺扭鑄刷梆爺稀飄誼妒劣殖韋痙箍燙敦矯鍬由柔柜數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced15Chapter 6. 分類: Advanced Methods貝葉斯信念網(wǎng)絡(luò)后向傳播分類 Classification by Backpropagation支持向量機(jī) Support Vector MachinesClassification by Using Frequent PatternsLazy Learners (or Learning from Your Neigh
15、bors)其他分類方法Additional Topics Regarding ClassificationSummary擲適版睬院秋城愁肋窩靶滬妒母板瑩熒皚貳剁縷胡說幕天解洽奶辟鶴軌砍數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced16分類:一個(gè)數(shù)學(xué)映射Classification:預(yù)測(cè)分類的類標(biāo)簽E.g., 個(gè)人主頁分類xi = (x1, x2, x3, ), yi = +1 or 1x1 : # of word “homepage”x2 : # of word “welcome” x X = n, y Y
16、= +1, 1, 推導(dǎo)一個(gè)函數(shù) f: X Y 線性分類二元分類問題紅線上面的點(diǎn)屬于 class x下面的點(diǎn)屬于 class oExamples: SVM, Perceptron, Probabilistic Classifiersxxxxxxxxxxooooooooooooo痰既擄捏礫乓杜涎溜青賦啼幽朵盅恕橙糧音韋檔國屜周揚(yáng)睛紫桃呢膿辦怎數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced17SVMSupport Vector Machines一個(gè)相對(duì)新的分類方法,適用于linear and nonlinear d
17、ata使用一個(gè)非線性映射把原始訓(xùn)練數(shù)據(jù)變換到高維空間中在新的維上, 搜索線性優(yōu)化分離超平面hyperplane (i.e., “決策邊界”)用一個(gè)合適的足夠高維的映射, 兩類數(shù)據(jù)總是可以被超平面分開SVM 使用support vectors (“基本” 選練元組) 和邊緣margins (由支持向量定義)發(fā)現(xiàn)超平面獺矽馳陵昆卿擲瑚人鑲雌妝蝕菩渝宴華描荷第淫鉑佰抗滁邪栽涂白呼鏟余數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced18SVM歷史和應(yīng)用Vapnik and colleagues (1992)基礎(chǔ)工作來自
18、于Vapnik & Chervonenkis statistical learning theory in 1960sFeatures: 訓(xùn)練慢但是準(zhǔn)確度高,由于能夠建模非線性決策邊界 (margin maximization)Used for: 分類和數(shù)值預(yù)測(cè)應(yīng)用: 手寫數(shù)字識(shí)別, object recognition, speaker identification, 基準(zhǔn)時(shí)間序列預(yù)測(cè)檢驗(yàn) 革莉倚剎急脅今伐屑厲贓宴警外豆砒蹦員芬雷桔義駝莊康抽披握縱擬矽飽數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced19支持
19、向量機(jī)的一般哲學(xué)Support VectorsSmall Margin邊界Large Margin嬸蝶筑液語墻汐絆蹲踐蛾卡攘費(fèi)扁信窮答潰鈣卷意克悟殊挪嫂茵忙節(jié)欣鉤數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced22 August 2022Data Mining: Concepts and Techniques20SVMMargins and Support Vectors按廳氨慨覺鍬遭貪強(qiáng)依遁托止韭速牽凄吮完椅慌墟難針景巴閩懊麓扮肖麻數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念
20、與技術(shù)CHAPTER6-分類ClassAdvanced21SVM當(dāng)數(shù)據(jù)線性可分時(shí)mD 為(X1, y1), , (X|D|, y|D|), 其中 Xi 帶標(biāo)簽yi的訓(xùn)練元組有無數(shù)條直線(hyperplanes) 可以分離兩個(gè)類,但我們需要發(fā)現(xiàn)最好的一個(gè)(對(duì)未知數(shù)據(jù)有最小化的分類誤差)SVM searches for the hyperplane with the largest margin, i.e., maximum marginal hyperplane (MMH)頰隙壩漓旅彪癬茲頰耙乾胚士淘濰使柵龍?zhí)漤?xiàng)偷艾曳攀缸梁擯凡艾青畜甭數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvan
21、ced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced22SVM線性可分一個(gè)分離超平面可以寫成W X + b = 0W=w1, w2, , wn 權(quán)重向量和標(biāo)量b (bias)對(duì)于2-D,可以寫成w0 + w1 x1 + w2 x2 = 0超平面定義了邊緣的邊界: H1: w0 + w1 x1 + w2 x2 1 for yi = +1, andH2: w0 + w1 x1 + w2 x2 1 for yi = 1任何一個(gè)位于超平面H1 or H2 (i.e., the sides defining the margin) 的樣本為 support vectors最大邊緣是
22、2/wmax是一個(gè) constrained (convex) quadratic optimization problem: 二次目標(biāo)函數(shù)和線性約束 Quadratic Programming (QP) Lagrangian multipliers剃譯釬藤尾廈慈腥蝦講似埂迅召氮最份括挎爺翌滅殿藏及釉篷式硼擲傳邏數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced23Why Is SVM Effective on High Dimensional Data?訓(xùn)練后的分類器的complexity由支持向量數(shù)而不是數(shù)據(jù)維
23、度刻畫支持向量support vectors是基本的/臨界的訓(xùn)練元組離決策邊界最近 (MMH)如果其他的樣本刪掉后重新訓(xùn)練仍然會(huì)發(fā)現(xiàn)相同的分離超平面支持向量的數(shù)目可用于計(jì)算(svm分類器)期望誤差率的上界 (upper), 其獨(dú)立于數(shù)據(jù)維度一個(gè)只有少量支持向量的svm有很好的推廣性能, 即使數(shù)據(jù)的維度很高時(shí)扶吠頒寶辣撿耘堯防稽桿嚼肢乓嫌辯絹?zhàn)R韶琴達(dá)德彭堪佐甸嚴(yán)伶丘敏沸候數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced24SVM線性不可分把原始輸入數(shù)據(jù)變換到一個(gè)更高維的空間Search for a linear
24、separating hyperplane in the new space抵莽蛙舉歪夷鍵席棠硼亢鎮(zhèn)虧伸熾磊蘋果伎諒樸礙騷鄰擦俄秩民紹隧子磨數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced25SVM: 不同的核函數(shù)計(jì)算變換后數(shù)據(jù)的點(diǎn)積, 數(shù)學(xué)上等價(jià)于應(yīng)用一個(gè)核函數(shù)K(Xi, Xj) 于原始數(shù)據(jù), i.e., K(Xi, Xj) = (Xi) (Xj) Typical Kernel FunctionsSVM 也可用于多類數(shù)據(jù) ( 2)和回歸分析(需要附加參數(shù))環(huán)孝惋雅詹火弘婦盲午刨僧忻堅(jiān)邵盡秉儀喝業(yè)藐誼弟銳濫靳虛
25、泥沁改寞慰數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced26SVM vs. Neural NetworkSVMDeterministic algorithmNice generalization propertiesHard to learn 使用 quadratic programming techniques批量學(xué)習(xí)Using kernels can learn very complex functionsNeural NetworkNondeterministic algorithmGeneralize
26、s well but doesnt have strong mathematical foundationCan easily be learned in incremental fashionTo learn complex functionsuse multilayer perceptron (nontrivial)瀉瀉枕缽兌圭萬裝甕焊簧飯?zhí)斝胤螠幼钚垒d焦位蔫斟圣趣有蹦癡墟疤裔數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced27SVM Related LinksSVM Website: /Represen
27、tative implementationsLIBSVM: an efficient implementation of SVM, multi-class classifications, nu-SVM, one-class SVM, including also various interfaces with java, python, etc.SVM-light: simpler but performance is not better than LIBSVM, support only binary classification and only in C SVM-torch: ano
28、ther recent implementation also written in C婿壇蠅救裴眨述外憨援亦亂呢細(xì)瘧邱卡綸辨言順貿(mào)脾貪歇閩佩旬魯思極用數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced28Chapter 6. 惰性學(xué)習(xí)Bayesian Belief NetworksClassification by BackpropagationSupport Vector MachinesClassification by Using Frequent PatternsLazy Learners (or Le
29、arning from Your Neighbors)Other Classification MethodsAdditional Topics Regarding ClassificationSummary啊茍桑恕君瑯杉挫桓炳迄漸吱漫即絳謙貼柔俐嗽撻勤井彭窟堡甩什赫凄稅數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced29Lazy vs. Eager LearningLazy vs. eager learningLazy learning (e.g., 基于實(shí)例的學(xué)習(xí)): 僅存儲(chǔ)數(shù)據(jù) (或稍加處理) 直到碰到檢
30、驗(yàn)元組才開始處理Eager learning (前面介紹的方法): 給定訓(xùn)練數(shù)據(jù),在遇到待處理的新數(shù)據(jù)前構(gòu)造分類模型Lazy: 訓(xùn)練用時(shí)很少,預(yù)測(cè)時(shí)用時(shí)多準(zhǔn)確性惰性學(xué)習(xí)方法可以有效地利用更豐富的假設(shè)空間,使用多個(gè)局部線性函數(shù)來對(duì)目標(biāo)函數(shù)形成一個(gè)隱式的全局逼近Eager: 必須限于一個(gè)假設(shè),它覆蓋了整個(gè)實(shí)例空間禍幸票墳靴饑箍旅途孿肖略擂簧硫痢延歪蓄頗褥妖銻辛汪迅勾椒涼戴娛爾數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced30Lazy Learner:基于實(shí)例的方法Instance-based learning:
31、Store training examples and delay the processing (“l(fā)azy evaluation”) until a new instance must be classified典型的方法k-nearest neighbor approach實(shí)例表示為歐氏空間中的點(diǎn).Locally weighted regressionConstructs local approximation基于案例的推理Case-based reasoning使用符號(hào)表示和知識(shí)為基礎(chǔ)的推理染垛頃契艇堯廢宴樣擻北屋沸褂娃滅審廄默晰進(jìn)寞嚼冰諒鋤曹運(yùn)乓魔牛訂數(shù)據(jù)挖掘概念與技術(shù)CHAPTER
32、6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced31k-最近鄰算法所有的樣本對(duì)應(yīng)于n-D 空間的點(diǎn)通過Euclidean distance, dist(X1, X2) 定義最近鄰居目標(biāo)函數(shù)可以是discrete- or real- 值對(duì)于離散值, k-NN 返回與目標(biāo)元組最近的k個(gè)訓(xùn)練樣本的多數(shù)類Vonoroi diagram: the decision surface induced by 1-NN for a typical set of training examples . _+_xq+_+_+.牽烤等逛剁斧授撿戮茨埔稀囑引肉瀾穩(wěn)憑皺
33、息疵峽理哮篙鵲造摩鼓徑毆賜數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced32k-NN Algorithm的討論k-NN:元組的未知實(shí)值的預(yù)測(cè)時(shí)返回與未知元組k個(gè)最近鄰居的平均值(對(duì)應(yīng)屬性)Distance-weighted nearest neighbor algorithm根據(jù)與目標(biāo)元組的距離權(quán)重組合k個(gè)近鄰的貢獻(xiàn)Give greater weight to closer neighborsRobust to noisy data by averaging k-nearest neighborsCurse
34、of dimensionality: 鄰居間的距離會(huì)被無關(guān)聯(lián)的屬性影響 坐標(biāo)軸伸縮或去除次要的屬性減靖炳藻末捍退鉗爹嘻炒窺友滄恐嗎壕把亂擁汪填啪繭銑躥沫楷鈉疽湛靳數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced33基于案例的推理 (CBR)CBR: 使用一個(gè)問題解的數(shù)據(jù)庫來求解新問題存儲(chǔ)符號(hào)描述(tuples or cases)不是Euclidean space的點(diǎn)應(yīng)用: 顧客-服務(wù)臺(tái) (產(chǎn)品有關(guān)的診斷), 合法裁決Methodology實(shí)例表示為復(fù)雜的符號(hào)描述(e.g., function graphs)搜索
35、相似的案例, 組合多個(gè)返回的例子Tight coupling between case retrieval, knowledge-based reasoning, and problem solvingChallengesFind a good similarity metric Indexing based on syntactic similarity measure, and when failure, backtracking, and adapting to additional cases熱帳摟速蒼覆謎雷富媳埋江賊哎妨簧盆脯框戒男批源塞族捷虐描察鬧草楞數(shù)據(jù)挖掘概念與技術(shù)CHAPTE
36、R6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced34Chapter 6. 分類: 其他方法Bayesian Belief NetworksClassification by BackpropagationSupport Vector MachinesClassification by Using Frequent PatternsLazy Learners (or Learning from Your Neighbors)Other Classification MethodsAdditional Topics Regarding Clas
37、sificationSummary弘懾郊咐乍塌寂搞亥財(cái)拜閱凰倉介蜀籌態(tài)契柔掐趟靡拎窮饑洽粟雍劉管裝數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced35遺傳算法 (GA)Genetic Algorithm: 模仿生物進(jìn)化使用隨機(jī)產(chǎn)生的規(guī)則組成一個(gè)最初的population每個(gè)規(guī)則有一系列位表示E.g., if A1 and A2 then C2 can be encoded as 100 如果一個(gè)屬性有 k 2 個(gè)值, 使用k位 基于適者生存原理, 最適合的規(guī)則及其后代組成新的種群規(guī)則的擬合度用它在訓(xùn)練樣本的準(zhǔn)確
38、率來評(píng)估通過交叉和突變來產(chǎn)生后代此過程持續(xù)下去,直到種群P進(jìn)化到其中的每個(gè)規(guī)則滿足給定的擬合度閾值算法慢,但易于并行理彝霜得聊眩帳予澳紫蘿戶訂辜裸柳耙編庶滋所淺忿仕苦域商涕澎假茵藻數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced36Fuzzy Set ApproachesFuzzy logic 使用0.0, 1.0真值來表示類的成員的隸屬度 屬性值被轉(zhuǎn)化成模糊值. Ex.:對(duì)于每個(gè)離散類別收入low, medium, high,x 被分配一個(gè)模糊的隸屬值, e.g. $49K 屬于 “medium income
39、” 0.15,屬于“high income” 的隸屬值是0.96模糊隸屬值的和不一定等于1.每個(gè)可用的規(guī)則為類的隸屬貢獻(xiàn)一票通常, 對(duì)每個(gè)預(yù)測(cè)分類的真值求和, 并組合這些值掙扒來??蠋r博酉貍蚤兇輝押叼逝蕊癟夯曬北乞茂缺淌皂琺債月韌黔補(bǔ)盡數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced37Chapter 6. 分類: Advanced MethodsBayesian Belief NetworksClassification by BackpropagationSupport Vector MachinesCla
40、ssification by Using Frequent PatternsLazy Learners (or Learning from Your Neighbors)Other Classification MethodsAdditional Topics Regarding ClassificationSummary英貿(mào)奄爺灣根抄蟄賠診敢郁掙稅瑣醇?jí)艉偎胰苦嶇R整和霹敲接下第棍糟妙數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced多類分類分類時(shí)設(shè)計(jì)多個(gè)類別 (i.e., 2 Classes) Method 1
41、. One-vs.-all (OVA): 每次學(xué)習(xí)一個(gè)分類器 給定m個(gè)類, 訓(xùn)練m個(gè)分類其,每個(gè)類別一個(gè)分類器 j: 把類別 j的元組定義為 positive & 其他的為 negative為分類樣本 X, 所有分類器投票來集成Method 2. All-vs.-all (AVA): 為每一對(duì)類別學(xué)習(xí)一個(gè)分類器Given m classes, construct m(m-1)/2 binary classifiers使用兩個(gè)類別的元組訓(xùn)練一個(gè)分類器為分類元組X, 每個(gè)分類器投票. X is assigned to the class with maximal voteComparisonAll
42、-vs.-all tends to be superior to one-vs.-allProblem: Binary classifier is sensitive to errors, and errors affect vote count38河鈴忠稍秤擰晝皚酪椰盾白迢禿結(jié)笨豢癱徑司諷咨帚崎撬晌針哇篷唉夏鈣數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced多類分類的Error-Correcting Codes最初目的是在數(shù)據(jù)傳輸?shù)耐ㄓ嵢蝿?wù)中通過探索數(shù)據(jù)冗余來修正誤差。例:A 7-bit codeword a
43、ssociated with classes 1-439ClassError-Corr. CodewordC11111111C20000111C30011001C40101010給定未知元組X, 7個(gè)分類器的結(jié)果為: 0001010Hamming distance: # 兩個(gè)碼字間不同位數(shù)的和H(X, C1) = 5, 檢查 1111111 & 0001010間不同位數(shù)和H(X, C2) = 3, H(X, C3) = 3, H(X, C4) = 1, thus C4 as the label for X Error-correcting codes can correct up to (h-
44、1)/h 1-bit error, where h is the minimum Hamming distance between any two codewords If we use 1-bit per class, it is equiv. to one-vs.-all approach, the code are insufficient to self-correctWhen selecting error-correcting codes, there should be good row-wise and col.-wise separation between the code
45、words 杯挺瑞吧果底浴食催捍臟禱埂偵議龐慮俱樹科墜蠕蛾邦篷矣貞悸八沿源鋪數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced半監(jiān)督分類Semi-supervised: 使用有標(biāo)簽和無標(biāo)簽數(shù)據(jù)構(gòu)造分類器Self-training: Build a classifier using the labeled dataUse it to label the unlabeled data, and those with the most confident label prediction are added to th
46、e set of labeled data重復(fù)以上過程Adv: 容易理解; disadv: 可能增大誤差Co-training: Use two or more classifiers to teach each other每個(gè)學(xué)習(xí)者使用元組的相互獨(dú)立的特征集合來訓(xùn)練一個(gè)好的分類器F1然后 f1 and f2 用來預(yù)測(cè)未知元組 X 的類別標(biāo)簽Teach each other: The tuple having the most confident prediction from f1 is added to the set of labeled data for f2, & vice vers
47、a Other methods, e.g., joint probability distribution of features and labels40磊囪聲瘧或館睬詞轉(zhuǎn)孩浴輕逮壇酬瞄攤徽市墳監(jiān)龔胚毀腳軟趙湃笆擂沽承數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced主動(dòng)學(xué)習(xí)Active Learning獲取類標(biāo)簽是昂貴Active learner: query human (oracle) for labelsPool-based approach: Uses a pool of unlabeled data
48、L: D中有標(biāo)簽的樣本子集, U: D的一個(gè)未標(biāo)記數(shù)據(jù)集使用一個(gè)查詢函數(shù)小心地從U選擇1或多個(gè)元組,并咨詢標(biāo)簽an oracle (a human annotator)The newly labeled samples are added to L, and learn a modelGoal: Achieve high accuracy using as few labeled data as possibleEvaluated using learning curves: Accuracy as a function of the number of instances queried (
49、# of tuples to be queried should be small)Research issue: How to choose the data tuples to be queried?Uncertainty sampling: choose the least certain onesReduce version space, the subset of hypotheses consistent w. the training dataReduce expected entropy over U: Find the greatest reduction in the to
50、tal number of incorrect predictions41克征粗賢鍺潭窩凳酞毅哆尋潤諾騁鈞擎拴兔瞄錢雍丫市鍵債媚戒砷級(jí)敝科數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced遷移學(xué)習(xí):概念框架Transfer learning: Extract knowledge from one or more source tasks and apply the knowledge to a target taskTraditional learning:每一個(gè)任務(wù)建立分類器Transfer learning:
51、 Build new classifier by applying existing knowledge learned from source tasks42Traditional Learning FrameworkTransfer Learning Framework兼散絢湯簽矗險(xiǎn)薊福袁譴么例蔓埠亭撾肢溯使嘿牌航乍瀝啊曬蝎荷瑞歧豢數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced遷移學(xué)習(xí): Methods and Applications應(yīng)用:數(shù)據(jù)過時(shí)或分布的變化時(shí), e.g., Web document
52、classification, e-mail spam filteringInstance-based transfer learning: Reweight some of the data from source tasks and use it to learn the target taskTrAdaBoost (Transfer AdaBoost)假定源和目標(biāo)數(shù)據(jù)用相同的屬性和類別描述, but rather diff. distributionsRequire only labeling a small amount of target data訓(xùn)練中使用源數(shù)據(jù): When a s
53、ource tuple is misclassified, reduce the weight of such tupels so that they will have less effect on the subsequent classifierResearch issuesNegative transfer: When it performs worse than no transfer at allHeterogeneous transfer learning: Transfer knowledge from different feature space or multiple s
54、ource domainsLarge-scale transfer learning43建杭譴戎具尹獅詐逼蘑烘肯署迅鱉搜嫉渤避路遼升鯉斂醚斜購糙歲社玖乓數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced44Chapter 6. 分類:頻繁模式Bayesian Belief NetworksClassification by BackpropagationSupport Vector MachinesClassification by Using Frequent PatternsLazy Learners (or
55、 Learning from Your Neighbors)Other Classification MethodsAdditional Topics Regarding ClassificationSummary廠勺鐮恃姑邏犯治臆廊肅監(jiān)嚎娥痹缸猾蔥慨題嚨時(shí)喊沾勾棱送飼晰漿阻免數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced45關(guān)聯(lián)分類關(guān)聯(lián)分類: 主要步驟挖掘關(guān)于頻繁模式(屬性-值對(duì)的聯(lián)結(jié)) 和類標(biāo)簽間的強(qiáng)關(guān)聯(lián)產(chǎn)生以下形似的關(guān)聯(lián)規(guī)則 P1 p2 pl “Aclass = C” (conf, sup)組織規(guī)則,形
56、成基于規(guī)則的分類器為什么有效? 可以發(fā)現(xiàn)(在多個(gè)屬性間)高置信度的關(guān)聯(lián),可以克服決策樹規(guī)約引入的約束, 決策樹一次考慮一個(gè)屬性研究發(fā)現(xiàn),關(guān)聯(lián)分類通常比某些傳統(tǒng)的分類方法更精確, 例如C4.5刁鯨凳沖藻皺愛擴(kuò)均蕊滌宇菱問匈梯名迸象察粹消您口品危篙期籽墑訴躍數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced46典型的關(guān)聯(lián)分類方法CBA (Classification Based on Associations: Liu, Hsu & Ma, KDD98)挖掘可能關(guān)聯(lián)規(guī)則:Cond-set (屬性-值 的集合) cla
57、ss label建立分類器: 基于置信度和支持度的下降序組織規(guī)則CMAR (Classification based on Multiple Association Rules: Li, Han, Pei, ICDM01)分類: 多個(gè)規(guī)則的統(tǒng)計(jì)分析CPAR (Classification based on Predictive Association Rules: Yin & Han, SDM03)產(chǎn)生預(yù)測(cè)性規(guī)則 (FOIL-like analysis) 允許覆蓋的元組以降低權(quán)重形式保留下來構(gòu)造新規(guī)則(根據(jù)期望準(zhǔn)確率)使用最好的k 個(gè)規(guī)則預(yù)測(cè)更有效(產(chǎn)生規(guī)則少), 精確性類似CMAR襖功儲(chǔ)幌嶼
58、躺硒枷劃幼厭琉殼欄觀此咸直湯糜指燈援咬鞏滌慰戮幽抒趣汗數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced47頻繁模式 vs. 單個(gè)特征(a) Austral(c) Sonar(b) CleveFig. 1. Information Gain vs. Pattern Length某些頻繁模式的判別能力高于單個(gè)特征.冷寢膠帖棍袒爍犀疑傭奢語匙猜昏郭櫻包猾牟壇豹列晾景氯膿歸諺畦該飾數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced48經(jīng)驗(yàn)
59、結(jié)果 (a) Austral(c) Sonar(b) BreastFig. 2. Information Gain vs. Pattern Frequency書肋喉宴這納汝孔靛賺活寬隙笛噴卓旁吝撮癡藉民旗頂君韌坑葛控櫻司覆數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced49特征選擇Feature Selection給定頻繁模式集合, 存在non-discriminative和redundant 的模式, 他們會(huì)引起過度擬合我們希望選出discriminative patterns,并且去除冗余借用Maximal
60、 Marginal Relevance (MMR)的概念A(yù) document has high marginal relevance if it is both relevant to the query and contains minimal marginal similarity to previously selected documents癱鴛藉媚繕姓接糊博墩笛筑截謊應(yīng)英慕的冤謄畝漾洼蔑錦重遁渤個(gè)頑聾秘?cái)?shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced數(shù)據(jù)挖掘概念與技術(shù)CHAPTER6-分類ClassAdvanced50實(shí)驗(yàn)結(jié)果50齊堡礦滁墾蟬恕衡乘服郴陽胡二菌酚頭礫
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年林業(yè)應(yīng)對(duì)氣候變化崗位試題含答案
- 互聯(lián)網(wǎng)金融合規(guī)培訓(xùn)課件
- 健身行業(yè)安全與健康指導(dǎo)手冊(cè)(標(biāo)準(zhǔn)版)
- 2026年劇本殺運(yùn)營公司員工入職培訓(xùn)管理制度
- 2026年劇本殺運(yùn)營公司劇本結(jié)局演繹規(guī)范管理制度
- 智能圖像識(shí)別在2025年跨境數(shù)字內(nèi)容審核平臺(tái)的應(yīng)用可行性研究
- 產(chǎn)后健康評(píng)估與隨訪管理
- 2025年太陽能光伏板回收十年技術(shù)報(bào)告
- 交通輔警面試題目及答案
- 2026年柔性顯示材料創(chuàng)新應(yīng)用報(bào)告
- 2024-2025學(xué)年江蘇省南京市玄武區(qū)八年級(jí)上學(xué)期期末語文試題及答案
- 專升本語文教學(xué)課件
- 別人買房子給我合同范本
- 電力通信培訓(xùn)課件
- 中建三局2024年項(xiàng)目經(jīng)理思維導(dǎo)圖
- 基層黨建知識(shí)測(cè)試題及答案
- DG-TJ08-2021-2025 干混砌筑砂漿抗壓強(qiáng)度現(xiàn)場(chǎng)檢測(cè)技術(shù)標(biāo)準(zhǔn)
- 鼻竇炎的護(hù)理講課課件
- 腸系膜脂膜炎CT診斷
- 體外膜肺氧合技術(shù)ECMO培訓(xùn)課件
- 老年醫(yī)院重點(diǎn)??平ㄔO(shè)方案
評(píng)論
0/150
提交評(píng)論