第4課數(shù)據(jù)分類和預(yù)測(cè)_第1頁
第4課數(shù)據(jù)分類和預(yù)測(cè)_第2頁
第4課數(shù)據(jù)分類和預(yù)測(cè)_第3頁
第4課數(shù)據(jù)分類和預(yù)測(cè)_第4頁
第4課數(shù)據(jù)分類和預(yù)測(cè)_第5頁
已閱讀5頁,還剩39頁未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、第4課 數(shù)據(jù)分類和預(yù)測(cè) 副教授 1內(nèi)容提綱What is classification? What is prediction?Issues regarding classification and predictionClassification by decision tree inductionBayesian ClassificationPredictionSummaryReference2Classification predicts categorical class labels (discrete or nominal)classifies data (constructs a

2、 model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new dataPrediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applicationsCredit approvalTarget marketingMedical diagnosisFraud detectionCla

3、ssification vs. Prediction3ClassificationA Two-Step Process Model construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction is training setThe model is r

4、epresented as classification rules, decision trees, or mathematical formulaeModel usage: for classifying future or unknown objectsEstimate accuracy of the modelThe known label of test sample is compared with the classified result from the modelAccuracy rate is the percentage of test set samples that

5、 are correctly classified by the modelTest set is independent of training set, otherwise over-fitting will occurIf the accuracy is acceptable, use the model to classify data tuples whose class labels are not known4Classification Process (1): Model ConstructionTrainingDataClassificationAlgorithmsIF r

6、ank = professorOR years 6THEN tenured = yes Classifier(Model)5Classification Process (2): Use the Model in PredictionClassifierTestingDataUnseen Data(Jeff, Professor, 4)Tenured?6Supervised vs. Unsupervised LearningSupervised learning (classification)Supervision: The training data (observations, meas

7、urements, etc.) are accompanied by labels indicating the class of the observationsNew data is classified based on the training setUnsupervised learning (clustering)The class labels of training data is unknownGiven a set of measurements, observations, etc. with the aim of establishing the existence o

8、f classes or clusters in the data7Issues Regarding Classification and Prediction (1): Data PreparationData cleaningPreprocess data in order to reduce noise and handle missing valuesRelevance analysis (feature selection)Remove the irrelevant or redundant attributesData transformationGeneralize and/or

9、 normalize data8Issues regarding classification and prediction (2): Evaluating classification methodsAccuracy: classifier accuracy and predictor accuracySpeed and scalabilitytime to construct the model (training time)time to use the model (classification/prediction time)Robustnesshandling noise and

10、missing valuesScalabilityefficiency in disk-resident databases Interpretabilityunderstanding and insight provided by the modelOther measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules9Decision Tree Induction: Training DatasetThis follows an example of

11、 Quinlans ID3 (Playing Tennis)10Output: A Decision Tree for “buys_computer”age?overcaststudent?credit rating?noyesfairexcellent40nonoyesyesyes30.4011Algorithm for Decision Tree InductionBasic algorithm (a greedy algorithm)Tree is constructed in a top-down recursive divide-and-conquer mannerAt start,

12、 all the training examples are at the rootAttributes are categorical (if continuous-valued, they are discretized in advance)Examples are partitioned recursively based on selected attributesTest attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)Conditi

13、ons for stopping partitioningAll samples for a given node belong to the same classThere are no remaining attributes for further partitioning majority voting is employed for classifying the leafThere are no samples left12Attribute Selection Measure: Information Gain (ID3/C4.5)Select the attribute wit

14、h the highest information gainS contains si tuples of class Ci for i = 1, , m information measures info required to classify any arbitrary tupleentropy of attribute A with values a1,a2,avinformation gained by branching on attribute A13Attribute Selection by Information Gain ComputationClass P: buys_

15、computer = “yes”Class N: buys_computer = “no”I(p, n) = I(9, 5) =0.940Compute the entropy for age: means “age split-point15Extracting Classification Rules from TreesRepresent the knowledge in the form of IF-THEN rulesOne rule is created for each path from the root to a leafEach attribute-value pair a

16、long a path forms a conjunctionThe leaf node holds the class predictionRules are easier for humans to understandExampleIF age = “=30” AND student = “no” THEN buys_computer = “no”IF age = “40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “=30” AND credit_rating = “fair” THEN buy

17、s_computer = “no”16Avoid Overfitting in ClassificationOverfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliersPoor accuracy for unseen samplesTwo approaches to avoid overfitting Prepruning: Halt tree construction earlydo not sp

18、lit a node if this would result in the goodness measure falling below a thresholdDifficult to choose an appropriate thresholdPostpruning: Remove branches from a “fully grown” treeget a sequence of progressively pruned treesUse a set of data different from the training data to decide which is the “be

19、st pruned tree”17Approaches to Determine the Final Tree SizeSeparate training (2/3) and testing (1/3) setsUse cross validationUse all the data for trainingbut apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution18Enhancements

20、 to Basic Decision Tree InductionAllow for continuous-valued attributesDynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervalsHandle missing attribute valuesAssign the most common value of the attributeAssign probability to eac

21、h of the possible valuesAttribute constructionCreate new attributes based on existing ones that are sparsely representedThis reduces fragmentation, repetition, and replication19Classification in Large DatabasesClassificationa classical problem extensively studied by statisticians and machine learnin

22、g researchersScalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speedWhy decision tree induction in data mining?relatively faster learning speed (than other classification methods)convertible to simple and easy to understand classification rulesca

23、n use SQL queries for accessing databasescomparable classification accuracy with other methods20Scalable Decision Tree Induction MethodsSLIQ (EDBT96 Mehta et al.)builds an index for each attribute and only class list and the current attribute list reside in memorySPRINT (VLDB96 J. Shafer et al.)cons

24、tructs an attribute list data structure PUBLIC (VLDB98 Rastogi & Shim)integrates tree splitting and tree pruning: stop growing the tree earlierRainForest (VLDB98 Gehrke, Ramakrishnan & Ganti)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list

25、(attribute, value, class label)21Presentation of Classification Results22Visualization of a Decision Tree in SGI/MineSet 3.023Interactive Visual Mining by Perception-Based Classification (PBC)24Bayesian Classification: Why?Probabilistic learning: Calculate explicit probabilities for hypothesis, amon

26、g the most practical approaches to certain types of learning problemsIncremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.Probabilistic prediction: Predict multiple hypotheses, weighted

27、 by their probabilitiesStandard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured25Bayesian Theorem: BasicsLet X be a data sample whose class label is unknownLet H be a hypothesis that X bel

28、ongs to class C For classification problems, determine P(H|X): the probability that the hypothesis holds given the observed data sample XP(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge)P(X): probability that sample d

29、ata is observedP(X|H): probability of observing the sample X, given that the hypothesis holds26Bayesian TheoremGiven training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theoremInformally, this can be written as posteriori = likelihood x prior / evidenceMAP (maximum po

30、steriori) hypothesisPractical difficulty: require initial knowledge of many probabilities, significant computational cost27Nave Bayes Classifier A simplified assumption: attributes are conditionally independent:The product of occurrence of say 2 elements x1 and x2, given the current class is C, is t

31、he product of the probabilities of each element taken separately, given the same class P(y1,y2, C) = P(y1, C) * P(y2, C)No dependence relation between attributes Greatly reduces the computation cost, only count the class distribution.Once the probability P(X|Ci) is known, assign X to the class with

32、maximum P(X|Ci) * P(Ci)28Training datasetClass:C1:buys_computer=yesC2:buys_computer=noData sample X =(age=30,Income=medium,Student=yesCredit_rating=Fair)29Nave Bayesian Classifier: An ExampleCompute P(X|Ci) for each classP(age=“30” | buys_computer=“yes”) = 2/9=0.222 P(age=“30” | buys_computer=“no”)

33、= 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_c

34、omputer=“no”)=2/5=0.4 X=(age=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_comput

35、er=“no”) * P(buys_computer=“no”)=0.007 Therefore, X belongs to class “buys_computer=yes”30Nave Bayesian Classifier: CommentsAdvantages Easy to implement Good results obtained in most of the casesDisadvantagesAssumption: class conditional independence, therefore loss of accuracyPractically, dependenc

36、ies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Nave Bayesian ClassifierHow to deal with these dependencies?Bayesian Belief Networks 31Bayesian Belief N

37、etworksBayesian belief network allows a subset of the variables conditionally independentA graphical model of causal relationshipsRepresents dependency among the variables Gives a specification of joint probability distribution XYZPNodes: random variablesLinks: dependencyX,Y are the parents of Z, an

38、d Y is the parent of PNo dependency between Z and PHas no loops or cycles32Bayesian Belief Network: An ExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLCLC(FH, S)(FH, S)(FH, S)(FH, S)0.10.9Bayesian Belief NetworksThe conditional probability table for the variable Lu

39、ngCancer:Shows the conditional probability for each possible combination of its parents33Learning Bayesian NetworksSeveral casesGiven both the network structure and all variables observable: learn only the CPTsNetwork structure known, some hidden variables: method of gradient descent, analogous to n

40、eural network learningNetwork structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purposeD. Heckerman, Bayesian networks for data mining34What Is Prediction?(Numerical) pr

41、ediction is similar to classificationconstruct a modeluse model to predict continuous or ordered value for a given inputPrediction is different from classificationClassification refers to predict categorical class labelPrediction models continuous-valued functionsMajor method for prediction: regress

42、ionmodel the relationship between one or more independent or predictor variables and a dependent or response variableRegression analysisLinear and multiple regressionNon-linear regressionOther regression methods: generalized linear model, Poisson regression, log-linear models, regression trees35Line

43、ar Regression Linear regression: involves a response variable y and a single predictor variable x,y = w0 + w1xwhere w0 (y-intercept) and w1 (slope) are regression coefficients Method of least squares: estimates the best-fitting straight lineMultiple linear regression: involves more than one predicto

44、r variableTraining data is of the form (X1, y1), (X2, y2), (X|D|, y|D|) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2Solvable by extension of least square method or using SAS, S-PlusMany nonlinear functions can be transformed into the above36Some nonlinear models can be modeled by a polynomia

45、l functionA polynomial regression model can be transformed into linear regression model. For example,y = w0 + w1 x + w2 x2 + w3 x3convertible to linear with new variables: x2 = x2, x3= x3y = w0 + w1 x + w2 x2 + w3 x3 Other functions, such as power function, can also be transformed to linear modelSom

46、e models are intractable nonlinear (e.g., sum of exponential terms)possible to obtain least square estimates through extensive calculation on more complex formulaeNonlinear Regression 37Generalized linear model: Foundation on which linear regression can be applied to modeling categorical response va

47、riablesVariance of y is a function of the mean value of y, not a constantLogistic regression: models the prob. of some event occurring as a linear function of a set of predictor variablesPoisson regression: models the data that exhibit a Poisson distributionLog-linear models: (for categorical data)A

48、pproximate discrete multidimensional prob. distributions Also useful for data compression and smoothingRegression trees and model treesTrees to predict continuous values rather than class labelsOther Regression-Based Models38Regression Trees and Model TreesRegression tree: proposed in CART system (B

49、reiman et al. 1984)CART: Classification And Regression TreesEach leaf stores a continuous-valued predictionIt is the average value of the predicted attribute for the training tuples that reach the leafModel tree: proposed by Quinlan (1992)Each leaf holds a regression modela multivariate linear equation for the predicted attributeA more general case than regression treeRegression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear model39Predictive modeling: Predict data values or construct generalized linear m

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論