數(shù)據(jù)挖掘?qū)д撚⑽腸hap4-basic-classification_第1頁(yè)
數(shù)據(jù)挖掘?qū)д撚⑽腸hap4-basic-classification_第2頁(yè)
數(shù)據(jù)挖掘?qū)д撚⑽腸hap4-basic-classification_第3頁(yè)
數(shù)據(jù)挖掘?qū)д撚⑽腸hap4-basic-classification_第4頁(yè)
數(shù)據(jù)挖掘?qū)д撚⑽腸hap4-basic-classification_第5頁(yè)
已閱讀5頁(yè),還剩96頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation,Lecture Notes for Chapter 4 Introduction to Data Mining by Tan, Steinbach, Kumar,Classification: Definition,Given a collection of records (training set ) Each record contains a set of attributes, one of the attributes i

2、s the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and tes

3、t sets, with training set used to build the model and test set used to validate it.,Illustrating Classification Task,Examples of Classification Task,Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of prot

4、ein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc,Classification Techniques,Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Nave Bayes and Bayesian Belief Networks Support Vector Machines,Exa

5、mple of a Decision Tree,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Splitting Attributes,Training Data,Model: Decision Tree,Another Example of Decision Tree,categorical,categorical,continuous,class,MarSt,Refund,TaxInc,YES,NO,NO,Yes,No,Married,Single, Divorced, 80K,There cou

6、ld be more than one tree that fits the same data!,Decision Tree Classification Task,Decision Tree,Apply Model to Test Data,Test Data,Start from the root of tree.,Apply Model to Test Data,Test Data,Apply Model to Test Data,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Test Dat

7、a,Apply Model to Test Data,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Test Data,Apply Model to Test Data,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorced, 80K,Test Data,Apply Model to Test Data,Refund,MarSt,TaxInc,YES,NO,NO,NO,Yes,No,Married,Single, Divorce

8、d, 80K,Test Data,Assign Cheat to “No”,Decision Tree Classification Task,Decision Tree,Decision Tree Induction,Many Algorithms: Hunts Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT,General Structure of Hunts Algorithm,Let Dt be the set of training records that reach a node t General Proce

9、dure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that belong to more than one class, use an attribute test to split the data into smaller subsets. Recur

10、sively apply the procedure to each subset.,Dt,?,Hunts Algorithm,Dont Cheat,Tree Induction,Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best s

11、plit? Determine when to stop splitting,Tree Induction,Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best split? Determine when to stop splitti

12、ng,How to Specify Test Condition?,Depends on attribute types Nominal Ordinal Continuous Depends on number of ways to split 2-way split Multi-way split,Splitting Based on Nominal Attributes,Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need

13、 to find optimal partitioning.,OR,Multi-way split: Use as many partitions as distinct values. Binary split: Divides values into two subsets. Need to find optimal partitioning. What about this split?,Splitting Based on Ordinal Attributes,OR,Splitting Based on Continuous Attributes,Different ways of h

14、andling Discretization to form an ordinal categorical attribute Static discretize once at the beginning Dynamic ranges can be found by equal interval bucketing, equal frequency bucketing (percentiles), or clustering. Binary Decision: (A v) or (A v) consider all possible splits and finds the best cut

15、 can be more compute intensive,Splitting Based on Continuous Attributes,Tree Induction,Greedy strategy. Split the records based on an attribute test that optimizes certain criterion. Issues Determine how to split the records How to specify the attribute test condition? How to determine the best spli

16、t? Determine when to stop splitting,How to determine the Best Split,Before Splitting: 10 records of class 0, 10 records of class 1,Which test condition is the best?,How to determine the Best Split,Greedy approach: Nodes with homogeneous class distribution are preferred Need a measure of node impurit

17、y:,Non-homogeneous, High degree of impurity,Homogeneous, Low degree of impurity,Measures of Node Impurity,Gini Index Entropy Misclassification error,How to Find the Best Split,B?,Yes,No,Node N3,Node N4,A?,Yes,No,Node N1,Node N2,Before Splitting:,Gain = M0 M12 vs M0 M34,Measure of Impurity: GINI,Gini

18、 Index for a given node t : (NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most interesting information,Ex

19、amples for computing GINI,P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0,P(C1) = 1/6 P(C2) = 5/6 Gini = 1 (1/6)2 (5/6)2 = 0.278,P(C1) = 2/6 P(C2) = 4/6 Gini = 1 (2/6)2 (4/6)2 = 0.444,Splitting Based on GINI,Used in CART, SLIQ, SPRINT. When a node p is split into k partitions (chi

20、ldren), the quality of split is computed as, where, ni = number of records at child i, n = number of records at node p.,Binary Attributes: Computing GINI Index,Splits into two partitions Effect of Weighing partitions: Larger and Purer Partitions are sought for.,B?,Yes,No,Node N1,Node N2,Gini(N1) = 1

21、 (5/6)2 (2/6)2 = 0.194 Gini(N2) = 1 (1/6)2 (4/6)2 = 0.528,Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333,Categorical Attributes: Computing Gini Index,For each distinct value, gather counts for each class in the dataset Use the count matrix to make decisions,Multi-way split,Two-way split (find

22、best partition of values),Continuous Attributes: Computing Gini Index,Use Binary Decisions based on one value Several Choices for the splitting value Number of possible splitting values = Number of distinct values Each splitting value has a count matrix associated with it Class counts in each of the

23、 partitions, A 0.5 or sqrt(x12+x22) t is classified as positive,ROC Curve,(TP,FP): (0,0): declare everything to be negative class (1,1): declare everything to be positive class (1,0): ideal Diagonal line: Random guessing Below diagonal line: prediction is opposite of the true class,Using ROC for Mod

24、el Comparison,No model consistently outperform the other M1 is better for small FPR M2 is better for large FPR Area Under the ROC curve Ideal: Area = 1 Random guess: Area = 0.5,How to Construct an ROC curve,Use classifier that produces posterior probability for each test instance P(+|A) Sort the ins

25、tances according to P(+|A) in decreasing order Apply threshold at each unique value of P(+|A) Count the number of TP, FP, TN, FN at each threshold TP rate, TPR = TP/(TP+FN) FP rate, FPR = FP/(FP + TN),How to construct an ROC curve,Threshold =,ROC Curve:,Test of Significance,Given two models: Model M

26、1: accuracy = 85%, tested on 30 instances Model M2: accuracy = 75%, tested on 5000 instances Can we say M1 is better than M2? How much confidence can we place on accuracy of M1 and M2? Can the difference in performance measure be explained as a result of random fluctuations in the test set?,Confiden

27、ce Interval for Accuracy,Prediction can be regarded as a Bernoulli trial A Bernoulli trial has 2 possible outcomes Possible outcomes for prediction: correct or wrong Collection of Bernoulli trials has a Binomial distribution: x Bin(N, p) x: number of correct predictions e.g: Toss a fair coin 50 time

28、s, how many heads would turn up? Expected number of heads = Np = 50 0.5 = 25 Given x (# of correct predictions) or equivalently, acc=x/N, and N (# of test instances), Can we predict p (true accuracy of model)?,Confidence Interval for Accuracy,For large test sets (N 30), acc has a normal distribution

29、 with mean p and variance p(1-p)/N Confidence Interval for p:,Area = 1 - ,Z/2,Z1- /2,Confidence Interval for Accuracy,Consider a model that produces an accuracy of 80% when evaluated on 100 test instances: N=100, acc = 0.8 Let 1- = 0.95 (95% confidence) From probability table, Z/2=1.96,Comparing Performance of 2 Models,Given two models, say M1 and M

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論