Data Mining Concepts and Techniques - Slides for Textbo:數(shù)據(jù)挖掘的概念和技術(shù)對(duì)教科書的_第1頁(yè)
Data Mining Concepts and Techniques - Slides for Textbo:數(shù)據(jù)挖掘的概念和技術(shù)對(duì)教科書的_第2頁(yè)
Data Mining Concepts and Techniques - Slides for Textbo:數(shù)據(jù)挖掘的概念和技術(shù)對(duì)教科書的_第3頁(yè)
Data Mining Concepts and Techniques - Slides for Textbo:數(shù)據(jù)挖掘的概念和技術(shù)對(duì)教科書的_第4頁(yè)
Data Mining Concepts and Techniques - Slides for Textbo:數(shù)據(jù)挖掘的概念和技術(shù)對(duì)教科書的_第5頁(yè)
已閱讀5頁(yè),還剩29頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

DataMining:

ConceptsandTechniques

—SlidesforTextbook—

—Chapter1—?JiaweiHanandMichelineKamberDepartmentofComputerScienceUniversityofIllinoisatUrbana-Champaign/~hanj1/12/20231DataMining:ConceptsandTechniques1/12/20232AcknowledgementsThissetofslidesstartedwithHan’stutorialforUCLAExtensioncourseinFebruary1998Othersubsequentcontributors:Dr.HongjunLu(HongKongUniv.ofScienceandTechnology)GraduatestudentsfromSimonFraserUniv.,Canada,notablyEugeneBelchev,JianPei,andOsmarR.ZaianeGraduatestudentsfromUniv.ofIllinoisatUrbana-Champaign1/12/20233CS497JHSchedule(Fall2002)Chapter1.Introduction{W1:L1}Chapter2.Datapre-processing{W4:L1-2}Homework#1distribution(SQLServer2000)Chapter3.DatawarehousingandOLAPtechnologyfordatamining{W2:L1-2,W3:L1-2}Homework#2distributionChapter4.Dataminingprimitives,languages,andsystemarchitectures{W5:L1}Chapter5.Conceptdescription:Characterizationandcomparison{W5:L2,W6:L1}Chapter6.Miningassociationrulesinlargedatabases{W6:L2,W7:L1-L21,W8:L1}Homework#3distributionChapter7.Classificationandprediction{W8:L2,W9:L2,W10:L1}Midterm{W9:L1}Chapter8.Clusteringanalysis{W10:L2,W11:L1-2}Homework#4distributionChapter9.Miningcomplextypesofdata{W12:L1-2,W13:L1-2}Chapter10.Dataminingapplicationsandtrendsindatamining{W14:L1}Research/Developmentprojectpresentation(W14-W15+finalexamperiod)FinalProjectDue1/12/20234WheretoFindtheSetofSlides?Bookpage:(MSPowerPointfiles):/~hanj/dmbookUpdatedcoursepresentationslides(.ppt):/~cs497jh/Researchpapers,DBMinersystem,andotherrelatedinformation:/~hanjordbminer1/12/20235Chapter1.IntroductionMotivation:Whydatamining?Whatisdatamining?DataMining:Onwhatkindofdata?DataminingfunctionalityAreallthepatternsinteresting?ClassificationofdataminingsystemsMajorissuesindatamining1/12/20236NecessityIstheMotherofInventionDataexplosionproblem

Automateddatacollectiontoolsandmaturedatabasetechnologyleadtotremendousamountsofdataaccumulatedand/ortobeanalyzedindatabases,datawarehouses,andotherinformationrepositoriesWearedrowningindata,butstarvingforknowledge!

Solution:DatawarehousinganddataminingDatawarehousingandon-lineanalyticalprocessingMiinginterestingknowledge(rules,regularities,patterns,constraints)fromdatainlargedatabases1/12/20237EvolutionofDatabaseTechnology1960s:Datacollection,databasecreation,IMSandnetworkDBMS1970s:Relationaldatamodel,relationalDBMSimplementation1980s:RDBMS,advanceddatamodels(extended-relational,OO,deductive,etc.)Application-orientedDBMS(spatial,scientific,engineering,etc.)1990s:Datamining,datawarehousing,multimediadatabases,andWebdatabases2000sStreamdatamanagementandminingDataminingwithavarietyofapplicationsWebtechnologyandglobalinformationsystems

1/12/20238WhatIsDataMining?Datamining(knowledgediscoveryfromdata)Extractionofinteresting(non-trivial,

implicit,previouslyunknownandpotentiallyuseful)

patternsorknowledgefromhugeamountofdataDatamining:amisnomer?AlternativenamesKnowledgediscovery(mining)indatabases(KDD),knowledgeextraction,data/patternanalysis,dataarcheology,datadredging,informationharvesting,businessintelligence,etc.Watchout:Iseverything“datamining”?(Deductive)queryprocessing.ExpertsystemsorsmallML/statisticalprograms1/12/20239WhyDataMining?—PotentialApplicationsDataanalysisanddecisionsupportMarketanalysisandmanagementTargetmarketing,customerrelationshipmanagement(CRM),marketbasketanalysis,crossselling,marketsegmentationRiskanalysisandmanagementForecasting,customerretention,improvedunderwriting,qualitycontrol,competitiveanalysisFrauddetectionanddetectionofunusualpatterns(outliers)OtherApplicationsTextmining(newsgroup,email,documents)andWebminingStreamdataminingDNAandbio-dataanalysis1/12/202310MarketAnalysisandManagementWheredoesthedatacomefrom?Creditcardtransactions,loyaltycards,discountcoupons,customercomplaintcalls,plus(public)lifestylestudiesTargetmarketingFindclustersof“model”customerswhosharethesamecharacteristics:interest,incomelevel,spendinghabits,etc.DeterminecustomerpurchasingpatternsovertimeCross-marketanalysisAssociations/co-relationsbetweenproductsales,&predictionbasedonsuchassociationCustomerprofilingWhattypesofcustomersbuywhatproducts(clusteringorclassification)CustomerrequirementanalysisidentifyingthebestproductsfordifferentcustomerspredictwhatfactorswillattractnewcustomersProvisionofsummaryinformationmultidimensionalsummaryreportsstatisticalsummaryinformation(datacentraltendencyandvariation)1/12/202311CorporateAnalysis&RiskManagementFinanceplanningandassetevaluationcashflowanalysisandpredictioncontingentclaimanalysistoevaluateassetscross-sectionalandtimeseriesanalysis(financial-ratio,trendanalysis,etc.)ResourceplanningsummarizeandcomparetheresourcesandspendingCompetitionmonitorcompetitorsandmarketdirectionsgroupcustomersintoclassesandaclass-basedpricingproceduresetpricingstrategyinahighlycompetitivemarket1/12/202312FraudDetection&MiningUnusualPatternsApproaches:Clustering&modelconstructionforfrauds,outlieranalysisApplications:Healthcare,retail,creditcardservice,telecomm.Autoinsurance:ringofcollisionsMoneylaundering:suspiciousmonetarytransactionsMedicalinsuranceProfessionalpatients,ringofdoctors,andringofreferencesUnnecessaryorcorrelatedscreeningtestsTelecommunications:phone-callfraudPhonecallmodel:destinationofthecall,duration,timeofdayorweek.AnalyzepatternsthatdeviatefromanexpectednormRetailindustryAnalystsestimatethat38%ofretailshrinkisduetodishonestemployeesAnti-terrorism1/12/202313OtherApplicationsSportsIBMAdvancedScoutanalyzedNBAgamestatistics(shotsblocked,assists,andfouls)togaincompetitiveadvantageforNewYorkKnicksandMiamiHeatAstronomyJPLandthePalomarObservatorydiscovered22quasarswiththehelpofdataminingInternetWebSurf-AidIBMSurf-AidappliesdataminingalgorithmstoWebaccesslogsformarket-relatedpagestodiscovercustomerpreferenceandbehaviorpages,analyzingeffectivenessofWebmarketing,improvingWebsiteorganization,etc.1/12/202314DataMining:AKDDProcessDatamining—coreofknowledgediscoveryprocessDataCleaningDataIntegrationDatabasesDataWarehouseKnowledgeTask-relevantDataSelectionDataMiningPatternEvaluation1/12/202315StepsofaKDDProcess

LearningtheapplicationdomainrelevantpriorknowledgeandgoalsofapplicationCreatingatargetdataset:dataselectionDatacleaningandpreprocessing:(maytake60%ofeffort!)DatareductionandtransformationFindusefulfeatures,dimensionality/variablereduction,invariantrepresentation.Choosingfunctionsofdataminingsummarization,classification,regression,association,clustering.Choosingtheminingalgorithm(s)Datamining:searchforpatternsofinterestPatternevaluationandknowledgepresentationvisualization,transformation,removingredundantpatterns,etc.Useofdiscoveredknowledge1/12/202316DataMiningandBusinessIntelligence

IncreasingpotentialtosupportbusinessdecisionsEndUserBusinessAnalystDataAnalystDBA

MakingDecisionsDataPresentationVisualizationTechniquesDataMiningInformationDiscoveryDataExplorationOLAP,MDAStatisticalAnalysis,QueryingandReportingDataWarehouses/DataMartsDataSourcesPaper,Files,InformationProviders,DatabaseSystems,OLTP1/12/202317Architecture:TypicalDataMiningSystemDataWarehouseDatacleaning&dataintegrationFilteringDatabasesDatabaseordatawarehouseserverDataminingenginePatternevaluationGraphicaluserinterfaceKnowledge-base1/12/202318DataMining:OnWhatKindsofData?RelationaldatabaseDatawarehouseTransactionaldatabaseAdvanceddatabaseandinformationrepositoryObject-relationaldatabaseSpatialandtemporaldataTime-seriesdataStreamdataMultimediadatabaseHeterogeneousandlegacydatabaseTextdatabases&WWW1/12/202319DataMiningFunctionalitiesConceptdescription:CharacterizationanddiscriminationGeneralize,summarize,andcontrastdatacharacteristics,e.g.,dryvs.wetregionsAssociation(correlationandcausality)DiaperàBeer[0.5%,75%]ClassificationandPrediction

Constructmodels(functions)thatdescribeanddistinguishclassesorconceptsforfuturepredictionE.g.,classifycountriesbasedonclimate,orclassifycarsbasedongasmileagePresentation:decision-tree,classificationrule,neuralnetworkPredictsomeunknownormissingnumericalvalues1/12/202320DataMiningFunctionalities(2)ClusteranalysisClasslabelisunknown:Groupdatatoformnewclasses,e.g.,clusterhousestofinddistributionpatternsMaximizingintra-classsimilarity&minimizinginterclasssimilarityOutlieranalysisOutlier:adataobjectthatdoesnotcomplywiththegeneralbehaviorofthedataNoiseorexception?No!usefulinfrauddetection,rareeventsanalysisTrendandevolutionanalysisTrendanddeviation:regressionanalysisSequentialpatternmining,periodicityanalysisSimilarity-basedanalysisOtherpattern-directedorstatisticalanalyses1/12/202321AreAllthe“Discovered”PatternsInteresting?Dataminingmaygeneratethousandsofpatterns:NotallofthemareinterestingSuggestedapproach:Human-centered,query-based,focusedminingInterestingnessmeasuresApatternisinterestingifitiseasilyunderstoodbyhumans,valid

onnew

ortestdatawithsomedegreeofcertainty,potentiallyuseful,novel,orvalidatessomehypothesisthatauserseekstoconfirmObjectivevs.subjectiveinterestingnessmeasuresObjective:basedonstatisticsandstructuresofpatterns,e.g.,support,confidence,etc.Subjective:basedonuser’sbeliefinthedata,e.g.,unexpectedness,novelty,actionability,etc.1/12/202322CanWeFindAllandOnlyInterestingPatterns?Findalltheinterestingpatterns:CompletenessCanadataminingsystemfindall

theinterestingpatterns?Heuristicvs.exhaustivesearchAssociationvs.classificationvs.clusteringSearchforonlyinterestingpatterns:AnoptimizationproblemCanadataminingsystemfindonlytheinterestingpatterns?ApproachesFirstgeneralallthepatternsandthenfilterouttheuninterestingones.Generateonlytheinterestingpatterns—miningqueryoptimization1/12/202323DataMining:ConfluenceofMultipleDisciplines

DataMiningDatabaseSystemsStatisticsOtherDisciplinesAlgorithmMachineLearningVisualization1/12/202324DataMining:ClassificationSchemesGeneralfunctionalityDescriptivedataminingPredictivedataminingDifferentviews,differentclassificationsKindsofdatatobeminedKindsofknowledgetobediscoveredKindsoftechniquesutilizedKindsofapplicationsadapted1/12/202325Multi-DimensionalViewofDataMiningDatatobeminedRelational,datawarehouse,transactional,stream,object-oriented/relational,active,spatial,time-series,text,multi-media,heterogeneous,legacy,WWWKnowledgetobeminedCharacterization,discrimination,association,classification,clustering,trend/deviation,outlieranalysis,etc.Multiple/integratedfunctionsandminingatmultiplelevelsTechniquesutilizedDatabase-oriented,datawarehouse(OLAP),machinelearning,statistics,visualization,etc.ApplicationsadaptedRetail,telecommunication,banking,fraudanalysis,bio-datamining,stockmarketanalysis,Webmining,etc.1/12/202326OLAPMining:IntegrationofDataMiningandDataWarehousingDataminingsystems,DBMS,DatawarehousesystemscouplingNocoupling,loose-coupling,semi-tight-coupling,tight-couplingOn-lineanalyticalminingdataintegrationofminingandOLAPtechnologiesInteractiveminingmulti-levelknowledgeNecessityofminingknowledgeandpatternsatdifferentlevelsofabstractionbydrilling/rolling,pivoting,slicing/dicing,etc.IntegrationofmultipleminingfunctionsCharacterizedclassification,firstclusteringandthenassociation1/12/202327AnOLAMArchitectureDataWarehouseMetaDataMDDBOLAMEngineOLAPEngineUserGUIAPIDataCubeAPIDatabaseAPIDatacleaningDataintegrationLayer3OLAP/OLAMLayer2MDDBLayer1DataRepositoryLayer4UserInterfaceFiltering&IntegrationFilteringDatabasesMiningqueryMiningresult1/12/202328MajorIssuesinDataMiningMiningmethodologyMiningdifferentkindsofknowledgefromdiversedatatypes,e.g.,bio,stream,WebPerformance:efficiency,effectiveness,andscalabilityPatternevaluation:theinterestingnessproblemIncorporationofbackgroundknowledgeHandlingnoiseandincompletedataParallel,distributedandincrementalminingmethodsIntegrationofthediscoveredknowledgewithexistingone:knowledgefusionUserinteractionDataminingquerylanguagesandad-hocminingExpressionandvisualizationofdataminingresultsInteractiveminingof

knowledgeatmultiplelevelsofabstractionApplicationsandsocialimpactsDomain-specificdatamining&invisibledataminingProtectionofdatasecurity,integrity,andprivacy1/12/202329SummaryDatamining:discoveringinterestingpatternsfromlargeamountsofdataAnaturalevolutionofdatabasetechnology,ingreatdemand,withwideapplicationsAKDDprocessincludesdatacleaning,dataintegration,dataselection,transformation,datamining,patternevaluation,andknowledgepresentationMiningcanbeperformedinavarietyofinformationrepositoriesDataminingfunctionalities:characterization,discrimination,association,classification,clustering,outlierandtrendanalysis,etc.DataminingsystemsandarchitecturesMajorissuesindatamining1/12/202330ABriefHistoryofDataMiningSociety1989IJCAIWorkshoponKnowledgeDiscoveryinDatabases(Piatetsky-Shapiro)KnowledgeDiscoveryinDatabases(G.Piatetsky-ShapiroandW.Frawley,1991)1991-1994WorkshopsonKnowledgeDiscoveryinDatabasesAdvancesinKnowledgeDiscoveryandDataMining(U.Fayyad,G.Piatetsky-Shapiro,P.Smyth,andR.Uthurusamy,1996)1995-1998InternationalConferencesonKnowledgeDiscoveryinDatabasesandDataMining(KDD’95-98)JournalofDataMiningandKnowledgeDiscovery(1997)1998ACMSIGKDD,SIGKDD’1999-2001conferences,andSIGKDDExplorationsMoreconferencesondataminingPAKDD(1997),PKDD(1997),SIAM-DataMining(2001),(IEEE)ICDM(2001),etc.1/12/202331WheretoFindReferences?DataminingandKDD(SIGKDD:CDROM)Conferences:ACM-SIGKDD,IEEE-ICDM,SIAM-DM,PKDD,PAKDD,etc.Journal:DataMiningandKnowledgeDis

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論