版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1、第4課 數(shù)據(jù)分類和預(yù)測(cè) 副教授 1內(nèi)容提綱What is classification? What is prediction?Issues regarding classification and predictionClassification by decision tree inductionBayesian ClassificationPredictionSummaryReference2Classification predicts categorical class labels (discrete or nominal)classifies data (constructs a
2、 model) based on the training set and the values (class labels) in a classifying attribute and uses it in classifying new dataPrediction models continuous-valued functions, i.e., predicts unknown or missing values Typical applicationsCredit approvalTarget marketingMedical diagnosisFraud detectionCla
3、ssification vs. Prediction3ClassificationA Two-Step Process Model construction: describing a set of predetermined classesEach tuple/sample is assumed to belong to a predefined class, as determined by the class label attributeThe set of tuples used for model construction is training setThe model is r
4、epresented as classification rules, decision trees, or mathematical formulaeModel usage: for classifying future or unknown objectsEstimate accuracy of the modelThe known label of test sample is compared with the classified result from the modelAccuracy rate is the percentage of test set samples that
5、 are correctly classified by the modelTest set is independent of training set, otherwise over-fitting will occurIf the accuracy is acceptable, use the model to classify data tuples whose class labels are not known4Classification Process (1): Model ConstructionTrainingDataClassificationAlgorithmsIF r
6、ank = professorOR years 6THEN tenured = yes Classifier(Model)5Classification Process (2): Use the Model in PredictionClassifierTestingDataUnseen Data(Jeff, Professor, 4)Tenured?6Supervised vs. Unsupervised LearningSupervised learning (classification)Supervision: The training data (observations, meas
7、urements, etc.) are accompanied by labels indicating the class of the observationsNew data is classified based on the training setUnsupervised learning (clustering)The class labels of training data is unknownGiven a set of measurements, observations, etc. with the aim of establishing the existence o
8、f classes or clusters in the data7Issues Regarding Classification and Prediction (1): Data PreparationData cleaningPreprocess data in order to reduce noise and handle missing valuesRelevance analysis (feature selection)Remove the irrelevant or redundant attributesData transformationGeneralize and/or
9、 normalize data8Issues regarding classification and prediction (2): Evaluating classification methodsAccuracy: classifier accuracy and predictor accuracySpeed and scalabilitytime to construct the model (training time)time to use the model (classification/prediction time)Robustnesshandling noise and
10、missing valuesScalabilityefficiency in disk-resident databases Interpretabilityunderstanding and insight provided by the modelOther measures, e.g., goodness of rules, such as decision tree size or compactness of classification rules9Decision Tree Induction: Training DatasetThis follows an example of
11、 Quinlans ID3 (Playing Tennis)10Output: A Decision Tree for “buys_computer”age?overcaststudent?credit rating?noyesfairexcellent40nonoyesyesyes30.4011Algorithm for Decision Tree InductionBasic algorithm (a greedy algorithm)Tree is constructed in a top-down recursive divide-and-conquer mannerAt start,
12、 all the training examples are at the rootAttributes are categorical (if continuous-valued, they are discretized in advance)Examples are partitioned recursively based on selected attributesTest attributes are selected on the basis of a heuristic or statistical measure (e.g., information gain)Conditi
13、ons for stopping partitioningAll samples for a given node belong to the same classThere are no remaining attributes for further partitioning majority voting is employed for classifying the leafThere are no samples left12Attribute Selection Measure: Information Gain (ID3/C4.5)Select the attribute wit
14、h the highest information gainS contains si tuples of class Ci for i = 1, , m information measures info required to classify any arbitrary tupleentropy of attribute A with values a1,a2,avinformation gained by branching on attribute A13Attribute Selection by Information Gain ComputationClass P: buys_
15、computer = “yes”Class N: buys_computer = “no”I(p, n) = I(9, 5) =0.940Compute the entropy for age: means “age split-point15Extracting Classification Rules from TreesRepresent the knowledge in the form of IF-THEN rulesOne rule is created for each path from the root to a leafEach attribute-value pair a
16、long a path forms a conjunctionThe leaf node holds the class predictionRules are easier for humans to understandExampleIF age = “=30” AND student = “no” THEN buys_computer = “no”IF age = “40” AND credit_rating = “excellent” THEN buys_computer = “yes”IF age = “=30” AND credit_rating = “fair” THEN buy
17、s_computer = “no”16Avoid Overfitting in ClassificationOverfitting: An induced tree may overfit the training data Too many branches, some may reflect anomalies due to noise or outliersPoor accuracy for unseen samplesTwo approaches to avoid overfitting Prepruning: Halt tree construction earlydo not sp
18、lit a node if this would result in the goodness measure falling below a thresholdDifficult to choose an appropriate thresholdPostpruning: Remove branches from a “fully grown” treeget a sequence of progressively pruned treesUse a set of data different from the training data to decide which is the “be
19、st pruned tree”17Approaches to Determine the Final Tree SizeSeparate training (2/3) and testing (1/3) setsUse cross validationUse all the data for trainingbut apply a statistical test (e.g., chi-square) to estimate whether expanding or pruning a node may improve the entire distribution18Enhancements
20、 to Basic Decision Tree InductionAllow for continuous-valued attributesDynamically define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervalsHandle missing attribute valuesAssign the most common value of the attributeAssign probability to eac
21、h of the possible valuesAttribute constructionCreate new attributes based on existing ones that are sparsely representedThis reduces fragmentation, repetition, and replication19Classification in Large DatabasesClassificationa classical problem extensively studied by statisticians and machine learnin
22、g researchersScalability: Classifying data sets with millions of examples and hundreds of attributes with reasonable speedWhy decision tree induction in data mining?relatively faster learning speed (than other classification methods)convertible to simple and easy to understand classification rulesca
23、n use SQL queries for accessing databasescomparable classification accuracy with other methods20Scalable Decision Tree Induction MethodsSLIQ (EDBT96 Mehta et al.)builds an index for each attribute and only class list and the current attribute list reside in memorySPRINT (VLDB96 J. Shafer et al.)cons
24、tructs an attribute list data structure PUBLIC (VLDB98 Rastogi & Shim)integrates tree splitting and tree pruning: stop growing the tree earlierRainForest (VLDB98 Gehrke, Ramakrishnan & Ganti)separates the scalability aspects from the criteria that determine the quality of the treebuilds an AVC-list
25、(attribute, value, class label)21Presentation of Classification Results22Visualization of a Decision Tree in SGI/MineSet 3.023Interactive Visual Mining by Perception-Based Classification (PBC)24Bayesian Classification: Why?Probabilistic learning: Calculate explicit probabilities for hypothesis, amon
26、g the most practical approaches to certain types of learning problemsIncremental: Each training example can incrementally increase/decrease the probability that a hypothesis is correct. Prior knowledge can be combined with observed data.Probabilistic prediction: Predict multiple hypotheses, weighted
27、 by their probabilitiesStandard: Even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured25Bayesian Theorem: BasicsLet X be a data sample whose class label is unknownLet H be a hypothesis that X bel
28、ongs to class C For classification problems, determine P(H|X): the probability that the hypothesis holds given the observed data sample XP(H): prior probability of hypothesis H (i.e. the initial probability before we observe any data, reflects the background knowledge)P(X): probability that sample d
29、ata is observedP(X|H): probability of observing the sample X, given that the hypothesis holds26Bayesian TheoremGiven training data X, posteriori probability of a hypothesis H, P(H|X) follows the Bayes theoremInformally, this can be written as posteriori = likelihood x prior / evidenceMAP (maximum po
30、steriori) hypothesisPractical difficulty: require initial knowledge of many probabilities, significant computational cost27Nave Bayes Classifier A simplified assumption: attributes are conditionally independent:The product of occurrence of say 2 elements x1 and x2, given the current class is C, is t
31、he product of the probabilities of each element taken separately, given the same class P(y1,y2, C) = P(y1, C) * P(y2, C)No dependence relation between attributes Greatly reduces the computation cost, only count the class distribution.Once the probability P(X|Ci) is known, assign X to the class with
32、maximum P(X|Ci) * P(Ci)28Training datasetClass:C1:buys_computer=yesC2:buys_computer=noData sample X =(age=30,Income=medium,Student=yesCredit_rating=Fair)29Nave Bayesian Classifier: An ExampleCompute P(X|Ci) for each classP(age=“30” | buys_computer=“yes”) = 2/9=0.222 P(age=“30” | buys_computer=“no”)
33、= 3/5 =0.6 P(income=“medium” | buys_computer=“yes”)= 4/9 =0.444 P(income=“medium” | buys_computer=“no”) = 2/5 = 0.4 P(student=“yes” | buys_computer=“yes)= 6/9 =0.667 P(student=“yes” | buys_computer=“no”)= 1/5=0.2 P(credit_rating=“fair” | buys_computer=“yes”)=6/9=0.667 P(credit_rating=“fair” | buys_c
34、omputer=“no”)=2/5=0.4 X=(age=30 , income =medium, student=yes, credit_rating=fair) P(X|Ci) : P(X|buys_computer=“yes”)= 0.222 x 0.444 x 0.667 x 0.0.667 =0.044 P(X|buys_computer=“no”)= 0.6 x 0.4 x 0.2 x 0.4 =0.019 P(X|Ci)*P(Ci ) : P(X|buys_computer=“yes”) * P(buys_computer=“yes”)=0.028 P(X|buys_comput
35、er=“no”) * P(buys_computer=“no”)=0.007 Therefore, X belongs to class “buys_computer=yes”30Nave Bayesian Classifier: CommentsAdvantages Easy to implement Good results obtained in most of the casesDisadvantagesAssumption: class conditional independence, therefore loss of accuracyPractically, dependenc
36、ies exist among variables E.g., hospitals: patients: Profile: age, family history etc Symptoms: fever, cough etc., Disease: lung cancer, diabetes etc Dependencies among these cannot be modeled by Nave Bayesian ClassifierHow to deal with these dependencies?Bayesian Belief Networks 31Bayesian Belief N
37、etworksBayesian belief network allows a subset of the variables conditionally independentA graphical model of causal relationshipsRepresents dependency among the variables Gives a specification of joint probability distribution XYZPNodes: random variablesLinks: dependencyX,Y are the parents of Z, an
38、d Y is the parent of PNo dependency between Z and PHas no loops or cycles32Bayesian Belief Network: An ExampleFamilyHistoryLungCancerPositiveXRaySmokerEmphysemaDyspneaLCLC(FH, S)(FH, S)(FH, S)(FH, S)0.10.9Bayesian Belief NetworksThe conditional probability table for the variable Lu
39、ngCancer:Shows the conditional probability for each possible combination of its parents33Learning Bayesian NetworksSeveral casesGiven both the network structure and all variables observable: learn only the CPTsNetwork structure known, some hidden variables: method of gradient descent, analogous to n
40、eural network learningNetwork structure unknown, all variables observable: search through the model space to reconstruct graph topology Unknown structure, all hidden variables: no good algorithms known for this purposeD. Heckerman, Bayesian networks for data mining34What Is Prediction?(Numerical) pr
41、ediction is similar to classificationconstruct a modeluse model to predict continuous or ordered value for a given inputPrediction is different from classificationClassification refers to predict categorical class labelPrediction models continuous-valued functionsMajor method for prediction: regress
42、ionmodel the relationship between one or more independent or predictor variables and a dependent or response variableRegression analysisLinear and multiple regressionNon-linear regressionOther regression methods: generalized linear model, Poisson regression, log-linear models, regression trees35Line
43、ar Regression Linear regression: involves a response variable y and a single predictor variable x,y = w0 + w1xwhere w0 (y-intercept) and w1 (slope) are regression coefficients Method of least squares: estimates the best-fitting straight lineMultiple linear regression: involves more than one predicto
44、r variableTraining data is of the form (X1, y1), (X2, y2), (X|D|, y|D|) Ex. For 2-D data, we may have: y = w0 + w1 x1+ w2 x2Solvable by extension of least square method or using SAS, S-PlusMany nonlinear functions can be transformed into the above36Some nonlinear models can be modeled by a polynomia
45、l functionA polynomial regression model can be transformed into linear regression model. For example,y = w0 + w1 x + w2 x2 + w3 x3convertible to linear with new variables: x2 = x2, x3= x3y = w0 + w1 x + w2 x2 + w3 x3 Other functions, such as power function, can also be transformed to linear modelSom
46、e models are intractable nonlinear (e.g., sum of exponential terms)possible to obtain least square estimates through extensive calculation on more complex formulaeNonlinear Regression 37Generalized linear model: Foundation on which linear regression can be applied to modeling categorical response va
47、riablesVariance of y is a function of the mean value of y, not a constantLogistic regression: models the prob. of some event occurring as a linear function of a set of predictor variablesPoisson regression: models the data that exhibit a Poisson distributionLog-linear models: (for categorical data)A
48、pproximate discrete multidimensional prob. distributions Also useful for data compression and smoothingRegression trees and model treesTrees to predict continuous values rather than class labelsOther Regression-Based Models38Regression Trees and Model TreesRegression tree: proposed in CART system (B
49、reiman et al. 1984)CART: Classification And Regression TreesEach leaf stores a continuous-valued predictionIt is the average value of the predicted attribute for the training tuples that reach the leafModel tree: proposed by Quinlan (1992)Each leaf holds a regression modela multivariate linear equation for the predicted attributeA more general case than regression treeRegression and model trees tend to be more accurate than linear regression when the data are not represented well by a simple linear model39Predictive modeling: Predict data values or construct generalized linear m
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2025年四川省岳池銀泰投資(控股)有限公司公開招聘急需緊缺專業(yè)人才備考題庫及一套完整答案詳解
- 2025年廣西工藝美術(shù)研究院有限公司所屬企業(yè)廣西絹麻紡織科學(xué)研究所有限公司招聘?jìng)淇碱}庫附答案詳解
- 2025年國藥東風(fēng)總醫(yī)院招聘46人備考題庫及完整答案詳解一套
- 瀏陽市衛(wèi)生健康局2025年公開招聘鄉(xiāng)村醫(yī)生備考題庫帶答案詳解
- 廣州市天河區(qū)靈秀小學(xué)2025年12月公開招聘編外聘用制專任教師二次延遲備考題庫帶答案詳解
- 國星光電2026屆高校畢業(yè)生招聘?jìng)淇碱}庫及參考答案詳解
- 中國鐵路南昌局集團(tuán)有限公司2026年度招聘本科及以上學(xué)歷畢業(yè)生24人備考題庫及一套完整答案詳解
- 2025年鎮(zhèn)康縣公安局關(guān)于公開招聘警務(wù)輔助人員5人的備考題庫及答案詳解1套
- 2025年匯能控股集團(tuán)有限公司卓正煤化工招聘144人備考題庫及參考答案詳解1套
- 江蘇省泰興市部分高中學(xué)校2026年公開招聘高層次人才30人備考題庫參考答案詳解
- 醫(yī)院產(chǎn)科培訓(xùn)課件:《妊娠期宮頸疾病的診治策略》
- 水質(zhì)監(jiān)測(cè)服務(wù)投標(biāo)方案(技術(shù)標(biāo))
- 國家集采中選目錄1-8批(完整版)
- 【員工關(guān)系管理研究國內(nèi)外文獻(xiàn)綜述2800字】
- 《三只小豬蓋房子》拼音版故事
- YS/T 921-2013冰銅
- GB/T 6072.1-2008往復(fù)式內(nèi)燃機(jī)性能第1部分:功率、燃料消耗和機(jī)油消耗的標(biāo)定及試驗(yàn)方法通用發(fā)動(dòng)機(jī)的附加要求
- GB/T 3883.201-2017手持式、可移式電動(dòng)工具和園林工具的安全第2部分:電鉆和沖擊電鉆的專用要求
- GB/T 27807-2011聚酯粉末涂料用固化劑
- 21大自然的聲音同步練習(xí)(含答案)
- 低壓電氣基礎(chǔ)知識(shí)培訓(xùn)課件
評(píng)論
0/150
提交評(píng)論