版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation,Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar,Classification: Definition,Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes i
2、s the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and tes
3、t sets, with training set used to build the model and test set used to validate it.,Illustrating Classification Task,Examples of Classification Task,Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of prot
4、ein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc,Classification Techniques,Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Nave Bayes and Bayesian Belief Networks Support Vector Machines,Exa
5、mple of a Decision Tree,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Splitting Attributes,Training Data,Model: Decision Tree,Another Example of Decision Tree,categorical,categorical,continuous,class,MarSt,Refund,TaxInc,YES,NO,NO,Yes,No,Married,Single, Divorced, 80K,There cou
6、ld be more than one tree that fits the same data!,Decision Tree Classification Task,Decision Tree,Apply Model to Test Data,Test Data,Start from the root of tree.,Apply Model to Test Data,Test Data,Apply Model to Test Data,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Test Dat
7、a,Apply Model to Test Data,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Test Data,Apply Model to Test Data,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Test Data,Apply Model to Test Data,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorce
8、d, 80K,Test Data,Assign Cheat to “No”,Decision Tree Classification Task,Decision Tree,Decision Tree Induction,Many Algorithms: Hunts Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT,General Structure of Hunts Algorithm,Let Dt be the set of training records that reach a node t General Proce
9、dure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recur
10、sively apply the procedure to each subset.,Dt,?,Hunts Algorithm,Dont Cheat,Tree Induction,Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best s
11、plit? Determine when to stop splitting,Tree Induction,Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitti
12、ng,How to Specify Test Condition?,Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split,Splitting Based on Nominal Attributes,Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need
13、 to find optimal partitioning.,OR,Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split?,Splitting Based on Ordinal Attributes,OR,Splitting Based on Continuous Attributes,Different ways of h
14、andling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A v) or (A v) consider all possible splits and finds the best cut
15、 can be more compute intensive,Splitting Based on Continuous Attributes,Tree Induction,Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best spli
16、t? Determine when to stop splitting,How to determine the Best Split,Before Splitting: 10 records of class 0, 10 records of class 1,Which test condition is the best?,How to determine the Best Split,Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurit
17、y:,Non-homogeneous, High degree of impurity,Homogeneous, Low degree of impurity,Measures of Node Impurity,Gini Index Entropy Misclassification error,How to Find the Best Split,B?,Yes,No,Node N3,Node N4,A?,Yes,No,Node N1,Node N2,Before Splitting:,Gain = M0 M12 vs M0 M34,Measure of Impurity: GINI,Gini
18、 Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information,Ex
19、amples for computing GINI,P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0,P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6)2 (5/6)2 = 0.278,P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6)2 (4/6)2 = 0.444,Splitting Based on GINI,Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (chi
20、ldren), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p.,Binary Attributes: Computing GINI Index,Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for.,B?,Yes,No,Node N1,Node N2,Gini(N1) = 1
21、 (5/6)2 (2/6)2 = 0.194 Gini(N2) = 1 (1/6)2 (4/6)2 = 0.528,Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333,Categorical Attributes: Computing Gini Index,For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions,Multi-way split,Two-way split (find
22、best partition of values),Continuous Attributes: Computing Gini Index,Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the
23、 partitions, A 0.5 or sqrt(x12+x22) t is classified as positive,ROC Curve,(TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class,Using ROC for Mod
24、el Comparison,No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5,How to Construct an ROC curve,Use classifier that produces posterior probability for each test instance P(+|A) Sort the ins
25、tances according to P(+|A) in decreasing order Apply threshold at each unique value of P(+|A) Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN),How to construct an ROC curve,Threshold =,ROC Curve:,Test of Significance,Given two models: Model M
26、1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set?,Confiden
27、ce Interval for Accuracy,Prediction can be regarded as a Bernoulli trial A Bernoulli trial has 2 possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 time
28、s, how many heads would turn up? Expected number of heads = Np = 50 0.5 = 25 Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)?,Confidence Interval for Accuracy,For large test sets (N 30), acc has a normal distribution
29、 with mean p and variance p(1-p)/N Confidence Interval for p:,Area = 1 - ,Z/2,Z1- /2,Confidence Interval for Accuracy,Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1- = 0.95 (95% confidence) From probability table, Z/2=1.96,Comparing Performance of 2 Models,Given two models, say M1 and M
溫馨提示
- 1. 本站所有資源如無(wú)特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年大學(xué)生態(tài)工程(生態(tài)修復(fù)工程)試題及答案
- 2025年大學(xué)農(nóng)學(xué)(農(nóng)業(yè)技術(shù)研發(fā))試題及答案
- 2025年高職市場(chǎng)營(yíng)銷(促銷策略設(shè)計(jì))試題及答案
- 2025年中職安全(實(shí)操訓(xùn)練)試題及答案
- 2026年礦山安全(通風(fēng)管理)試題及答案
- 2025年高職第一學(xué)年(汽車檢測(cè)與維修技術(shù))維修實(shí)訓(xùn)階段測(cè)試題及答案
- 2025年高職電子技術(shù)應(yīng)用(電路故障排查)試題及答案
- 2025年高職表演(影視配音)試題及答案
- 2025年大學(xué)第三學(xué)年(大數(shù)據(jù)管理與應(yīng)用)數(shù)據(jù)分析階段測(cè)試題及答案
- 2025年中職(中草藥栽培)藥用植物種植測(cè)試題及答案
- 2026長(zhǎng)治日?qǐng)?bào)社工作人員招聘勞務(wù)派遣人員5人參考題庫(kù)及答案1套
- 2026年菏澤學(xué)院?jiǎn)握新殬I(yè)傾向性考試題庫(kù)附答案解析
- 2025年體育教師個(gè)人年終述職報(bào)告
- 實(shí)際問題與一次函數(shù)課件2025-2026學(xué)年人教版八年級(jí)數(shù)學(xué)下冊(cè)
- 2024年鹽城市體育局直屬事業(yè)單位招聘真題
- 2025-2026學(xué)年教科版(新教材)二年級(jí)上冊(cè)科學(xué)全冊(cè)知識(shí)點(diǎn)梳理歸納
- MDT在老年髖部骨折合并癥患者中的應(yīng)用策略
- 2026天津農(nóng)商銀行校園招聘考試歷年真題匯編附答案解析
- 八上語(yǔ)文期末作文押題??贾黝}佳作
- 2024屆河北省石家莊市普通高中學(xué)校畢業(yè)年級(jí)教學(xué)質(zhì)量摸底檢測(cè)物理試卷含答案
- 蘇教版數(shù)學(xué)五年級(jí)上冊(cè) 期末沖刺測(cè)評(píng)卷(一)(含答案)
評(píng)論
0/150
提交評(píng)論