版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
DataMining:
ConceptsandTechniques
—Chapter2—JiaweiHanDepartmentofComputerScienceUniversityofIllinoisatUrbana-Champaign/~hanj?2006JiaweiHanandMichelineKamber,Allrightsreserved2/3/202412/3/20242Chapter2:DataPreprocessingWhypreprocessthedata?DescriptivedatasummarizationDatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary2/3/20243WhyDataPreprocessing?Dataintherealworldisdirtyincomplete:lackingattributevalues,lackingcertainattributesofinterest,orcontainingonlyaggregatedatae.g.,occupation=“〞noisy:containingerrorsoroutlierse.g.,Salary=“-10〞inconsistent:containingdiscrepanciesincodesornamese.g.,Age=“42〞Birthday=“03/07/1997〞e.g.,Wasrating“1,2,3〞,nowrating“A,B,C〞e.g.,discrepancybetweenduplicaterecords2/3/20244WhyIsDataDirty?Incompletedatamaycomefrom“Notapplicable〞datavaluewhencollectedDifferentconsiderationsbetweenthetimewhenthedatawascollectedandwhenitisanalyzed.Human/hardware/softwareproblemsNoisydata(incorrectvalues)maycomefromFaultydatacollectioninstrumentsHumanorcomputererroratdataentryErrorsindatatransmissionInconsistentdatamaycomefromDifferentdatasourcesFunctionaldependencyviolation(e.g.,modifysomelinkeddata)Duplicaterecordsalsoneeddatacleaning2/3/20245WhyIsDataPreprocessingImportant?Noqualitydata,noqualityminingresults!Qualitydecisionsmustbebasedonqualitydatae.g.,duplicateormissingdatamaycauseincorrectorevenmisleadingstatistics.DatawarehouseneedsconsistentintegrationofqualitydataDataextraction,cleaning,andtransformationcomprisesthemajorityoftheworkofbuildingadatawarehouse2/3/20246Multi-DimensionalMeasureofDataQualityAwell-acceptedmultidimensionalview:AccuracyCompletenessConsistencyTimelinessBelievabilityValueaddedInterpretabilityAccessibilityBroadcategories:Intrinsic,contextual,representational,andaccessibility2/3/20247MajorTasksinDataPreprocessingDatacleaningFillinmissingvalues,smoothnoisydata,identifyorremoveoutliers,andresolveinconsistenciesDataintegrationIntegrationofmultipledatabases,datacubes,orfilesDatatransformationNormalizationandaggregationDatareductionObtainsreducedrepresentationinvolumebutproducesthesameorsimilaranalyticalresultsDatadiscretizationPartofdatareductionbutwithparticularimportance,especiallyfornumericaldata2/3/20248FormsofDataPreprocessing
2/3/20249Chapter2:DataPreprocessingWhypreprocessthedata?DescriptivedatasummarizationDatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary2/3/202410MiningDataDescriptive
CharacteristicsMotivationTobetterunderstandthedata:centraltendency,variationandspreadDatadispersioncharacteristics
median,max,min,quantiles,outliers,variance,etc.NumericaldimensionscorrespondtosortedintervalsDatadispersion:analyzedwithmultiplegranularitiesofprecisionBoxplotorquantileanalysisonsortedintervalsDispersionanalysisoncomputedmeasuresFoldingmeasuresintonumericaldimensionsBoxplotorquantileanalysisonthetransformedcube2/3/202411MeasuringtheCentralTendencyMean(algebraicmeasure)(samplevs.population):Weightedarithmeticmean:Trimmedmean:choppingextremevaluesMedian:AholisticmeasureMiddlevalueifoddnumberofvalues,oraverageofthemiddletwovaluesotherwiseEstimatedbyinterpolation(forgroupeddata):ModeValuethatoccursmostfrequentlyinthedataUnimodal,bimodal,trimodalEmpiricalformula:2/3/202412Symmetricvs.SkewedDataMedian,meanandmodeofsymmetric,positivelyandnegativelyskeweddata2/3/202413MeasuringtheDispersionofDataQuartiles,outliersandboxplotsQuartiles:Q1(25thpercentile),Q3(75thpercentile)Inter-quartilerange:IQR=Q3–
Q1Fivenumbersummary:min,Q1,M,
Q3,maxBoxplot:endsoftheboxarethequartiles,medianismarked,whiskers,andplotoutlierindividuallyOutlier:usually,avaluehigher/lowerthan1.5xIQRVarianceandstandarddeviation(sample:
s,population:σ)Variance:(algebraic,scalablecomputation)Standarddeviations(orσ)isthesquarerootofvariances2(or
σ2)2/3/202414PropertiesofNormalDistributionCurveThenormal(distribution)curveFromμ–σtoμ+σ:containsabout68%ofthemeasurements(μ:mean,σ:standarddeviation)Fromμ–2σtoμ+2σ:containsabout95%ofitFromμ–3σtoμ+3σ:containsabout99.7%ofit2/3/202415
BoxplotAnalysisFive-numbersummaryofadistribution:Minimum,Q1,M,Q3,MaximumBoxplotDataisrepresentedwithaboxTheendsoftheboxareatthefirstandthirdquartiles,i.e.,theheightoftheboxisIRQThemedianismarkedbyalinewithintheboxWhiskers:twolinesoutsidetheboxextendtoMinimumandMaximum2/3/202416VisualizationofDataDispersion:BoxplotAnalysis2/3/202417HistogramAnalysisGraphdisplaysofbasicstatisticalclassdescriptionsFrequencyhistogramsAunivariategraphicalmethodConsistsofasetofrectanglesthatreflectthecountsorfrequenciesoftheclassespresentinthegivendata2/3/202418QuantilePlotDisplaysallofthedata(allowingtheusertoassessboththeoverallbehaviorandunusualoccurrences)PlotsquantileinformationForadataxi
datasortedinincreasingorder,fi
indicatesthatapproximately100fi%ofthedataarebeloworequaltothevaluexi2/3/202419Quantile-Quantile(Q-Q)PlotGraphsthequantilesofoneunivariatedistributionagainstthecorrespondingquantilesofanotherAllowstheusertoviewwhetherthereisashiftingoingfromonedistributiontoanother2/3/202420ScatterplotProvidesafirstlookatbivariatedatatoseeclustersofpoints,outliers,etcEachpairofvaluesistreatedasapairofcoordinatesandplottedaspointsintheplane2/3/202421LoessCurveAddsasmoothcurvetoascatterplotinordertoprovidebetterperceptionofthepatternofdependenceLoesscurveisfittedbysettingtwoparameters:asmoothingparameter,andthedegreeofthepolynomialsthatarefittedbytheregression2/3/202422PositivelyandNegativelyCorrelatedData2/3/202423NotCorrelatedData2/3/202424GraphicDisplaysofBasicStatisticalDescriptionsHistogram:(shownbefore)Boxplot:(coveredbefore)Quantileplot:eachvaluexi
ispairedwithfiindicatingthatapproximately100fi%ofdataare
xi
Quantile-quantile(q-q)plot:graphsthequantilesofoneunivariantdistributionagainstthecorrespondingquantilesofanotherScatterplot:eachpairofvaluesisapairofcoordinatesandplottedaspointsintheplaneLoess(localregression)curve:addasmoothcurvetoascatterplottoprovidebetterperceptionofthepatternofdependence2/3/202425Chapter2:DataPreprocessingWhypreprocessthedata?DescriptivedatasummarizationDatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary2/3/202426DataCleaningImportance“Datacleaningisoneofthethreebiggestproblemsindatawarehousing〞—RalphKimball“Datacleaningisthenumberoneproblemindatawarehousing〞—DCIsurveyDatacleaningtasksFillinmissingvaluesIdentifyoutliersandsmoothoutnoisydataCorrectinconsistentdataResolveredundancycausedbydataintegration2/3/202427MissingDataDataisnotalwaysavailableE.g.,manytupleshavenorecordedvalueforseveralattributes,suchascustomerincomeinsalesdataMissingdatamaybeduetoequipmentmalfunctioninconsistentwithotherrecordeddataandthusdeleteddatanotenteredduetomisunderstandingcertaindatamaynotbeconsideredimportantatthetimeofentrynotregisterhistoryorchangesofthedataMissingdatamayneedtobeinferred.2/3/202428HowtoHandleMissingData?Ignorethetuple:usuallydonewhenclasslabelismissing(assumingthetasksinclassification—noteffectivewhenthepercentageofmissingvaluesperattributevariesconsiderably.Fillinthemissingvaluemanually:tedious+infeasible?Fillinitautomaticallywithaglobalconstant:e.g.,“unknown〞,anewclass?!theattributemeantheattributemeanforallsamplesbelongingtothesameclass:smarterthemostprobablevalue:inference-basedsuchasBayesianformulaordecisiontree2/3/202429NoisyDataNoise:randomerrororvarianceinameasuredvariableIncorrectattributevaluesmayduetofaultydatacollectioninstrumentsdataentryproblemsdatatransmissionproblemstechnologylimitationinconsistencyinnamingconventionOtherdataproblemswhichrequiresdatacleaningduplicaterecordsincompletedatainconsistentdata2/3/202430HowtoHandleNoisyData?Binningfirstsortdataandpartitioninto(equal-frequency)binsthenonecansmoothbybinmeans,smoothbybinmedian,smoothbybinboundaries,etc.RegressionsmoothbyfittingthedataintoregressionfunctionsClusteringdetectandremoveoutliersCombinedcomputerandhumaninspectiondetectsuspiciousvaluesandcheckbyhuman(e.g.,dealwithpossibleoutliers)2/3/202431SimpleDiscretizationMethods:BinningEqual-width(distance)partitioningDividestherangeintoNintervalsofequalsize:uniformgridifAandBarethelowestandhighestvaluesoftheattribute,thewidthofintervalswillbe:W=(B–A)/N.Themoststraightforward,butoutliersmaydominatepresentationSkeweddataisnothandledwellEqual-depth(frequency)partitioningDividestherangeintoNintervals,eachcontainingapproximatelysamenumberofsamplesGooddatascalingManagingcategoricalattributescanbetricky2/3/202432BinningMethodsforDataSmoothingSorteddataforprice(indollars):4,8,9,15,21,21,24,25,26,28,29,34*Partitionintoequal-frequency(equi-depth)bins:-Bin1:4,8,9,15-Bin2:21,21,24,25-Bin3:26,28,29,34*Smoothingbybinmeans:-Bin1:9,9,9,9-Bin2:23,23,23,23-Bin3:29,29,29,29*Smoothingbybinboundaries:-Bin1:4,4,4,15-Bin2:21,21,25,25-Bin3:26,26,26,342/3/202433Regressionxyy=x+1X1Y1Y1’2/3/202434ClusterAnalysis2/3/202435DataCleaningasaProcessDatadiscrepancydetectionUsemetadata(e.g.,domain,range,dependency,distribution)CheckfieldoverloadingCheckuniquenessrule,consecutiveruleandnullruleUsecommercialtoolsDatascrubbing:usesimpledomainknowledge(e.g.,postalcode,spell-check)todetecterrorsandmakecorrectionsDataauditing:byanalyzingdatatodiscoverrulesandrelationshiptodetectviolators(e.g.,correlationandclusteringtofindoutliers)DatamigrationandintegrationDatamigrationtools:allowtransformationstobespecifiedETL(Extraction/Transformation/Loading)tools:allowuserstospecifytransformationsthroughagraphicaluserinterfaceIntegrationofthetwoprocessesIterativeandinteractive(e.g.,Potter’sWheels)2/3/202436Chapter2:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary2/3/202437DataIntegrationDataintegration:CombinesdatafrommultiplesourcesintoacoherentstoreSchemaintegration:e.g.,A.cust-idB.cust-#IntegratemetadatafromdifferentsourcesEntityidentificationproblem:Identifyrealworldentitiesfrommultipledatasources,e.g.,BillClinton=WilliamClintonDetectingandresolvingdatavalueconflictsForthesamerealworldentity,attributevaluesfromdifferentsourcesaredifferentPossiblereasons:differentrepresentations,differentscales,e.g.,metricvs.Britishunits2/3/202438HandlingRedundancyinDataIntegrationRedundantdataoccuroftenwhenintegrationofmultipledatabasesObjectidentification:ThesameattributeorobjectmayhavedifferentnamesindifferentdatabasesDerivabledata:Oneattributemaybea“derived〞attributeinanothertable,e.g.,annualrevenueRedundantattributesmaybeabletobedetectedbycorrelationanalysisCarefulintegrationofthedatafrommultiplesourcesmayhelpreduce/avoidredundanciesandinconsistenciesandimproveminingspeedandquality2/3/202439CorrelationAnalysis(NumericalData)Correlationcoefficient(alsocalledPearson’sproductmomentcoefficient)wherenisthenumberoftuples,andaretherespectivemeansofAandB,σAandσBaretherespectivestandarddeviationofAandB,andΣ(AB)isthesumoftheABcross-product.IfrA,B>0,AandBarepositivelycorrelated(A’svaluesincreaseasB’s).Thehigher,thestrongercorrelation.rA,B=0:independent;rA,B<0:negativelycorrelated2/3/202440CorrelationAnalysis(CategoricalData)Χ2(chi-square)testThelargertheΧ2value,themorelikelythevariablesarerelatedThecellsthatcontributethemosttotheΧ2valuearethosewhoseactualcountisverydifferentfromtheexpectedcountCorrelationdoesnotimplycausality#ofhospitalsand#ofcar-theftinacityarecorrelatedBotharecausallylinkedtothethirdvariable:population2/3/202441Chi-SquareCalculation:AnExampleΧ2(chi-square)calculation(numbersinparenthesisareexpectedcountscalculatedbasedonthedatadistributioninthetwocategories)Itshowsthatlike_science_fictionandplay_chessarecorrelatedinthegroupPlaychessNotplaychessSum(row)Likesciencefiction250(90)200(360)450Notlikesciencefiction50(210)1000(840)1050Sum(col.)300120015002/3/202442DataTransformationSmoothing:removenoisefromdataAggregation:summarization,datacubeconstructionGeneralization:concepthierarchyclimbingNormalization:scaledtofallwithinasmall,specifiedrangemin-maxnormalizationz-scorenormalizationnormalizationbydecimalscalingAttribute/featureconstructionNewattributesconstructedfromthegivenones2/3/202443DataTransformation:NormalizationMin-maxnormalization:to[new_minA,new_maxA]Ex.Letincomerange$12,000to$98,000normalizedto[0.0,1.0].Then$73,000ismappedtoZ-scorenormalization(μ:mean,σ:standarddeviation):Ex.Letμ=54,000,σ=16,000.ThenNormalizationbydecimalscalingWherejisthesmallestintegersuchthatMax(|ν’|)<12/3/202444Chapter2:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary2/3/202445DataReductionStrategiesWhydatareduction?Adatabase/datawarehousemaystoreterabytesofdataComplexdataanalysis/miningmaytakeaverylongtimetorunonthecompletedatasetDatareductionObtainareducedrepresentationofthedatasetthatismuchsmallerinvolumebutyetproducethesame(oralmostthesame)analyticalresultsDatareductionstrategiesDatacubeaggregation:Dimensionalityreduction—e.g.,
removeunimportantattributesDataCompressionNumerosityreduction—e.g.,
fitdataintomodelsDiscretizationandconcepthierarchygeneration2/3/202446DataCubeAggregationThelowestlevelofadatacube(basecuboid)TheaggregateddataforanindividualentityofinterestE.g.,acustomerinaphonecallingdatawarehouseMultiplelevelsofaggregationindatacubesFurtherreducethesizeofdatatodealwithReferenceappropriatelevelsUsethesmallestrepresentationwhichisenoughtosolvethetaskQueriesregardingaggregatedinformationshouldbeansweredusingdatacube,whenpossible2/3/202447AttributeSubsetSelectionFeatureselection(i.e.,attributesubsetselection):Selectaminimumsetoffeaturessuchthattheprobabilitydistributionofdifferentclassesgiventhevaluesforthosefeaturesisascloseaspossibletotheoriginaldistributiongiventhevaluesofallfeaturesreduce#ofpatternsinthepatterns,easiertounderstandHeuristicmethods(duetoexponential#ofchoices):Step-wiseforwardselectionStep-wisebackwardeliminationCombiningforwardselectionandbackwardeliminationDecision-treeinduction2/3/202448ExampleofDecisionTreeInductionInitialattributeset:{A1,A2,A3,A4,A5,A6}A4?A1?A6?Class1Class2Class1Class2>Reducedattributeset:{A1,A4,A6}2/3/202449HeuristicFeatureSelectionMethodsThereare2d
possiblesub-featuresofdfeaturesSeveralheuristicfeatureselectionmethods:Bestsinglefeaturesunderthefeatureindependenceassumption:choosebysignificancetestsBeststep-wisefeatureselection:Thebestsingle-featureispickedfirstThennextbestfeatureconditiontothefirst,...Step-wisefeatureelimination:RepeatedlyeliminatetheworstfeatureBestcombinedfeatureselectionandeliminationOptimalbranchandbound:Usefeatureeliminationandbacktracking2/3/202450DataCompressionStringcompressionThereareextensivetheoriesandwell-tunedalgorithmsTypicallylosslessButonlylimitedmanipulationispossiblewithoutexpansionAudio/videocompressionTypicallylossycompression,withprogressiverefinementSometimessmallfragmentsofsignalcanbereconstructedwithoutreconstructingthewholeTimesequenceisnotaudioTypicallyshortandvaryslowlywithtime2/3/202451DataCompressionOriginalDataCompressedDatalosslessOriginalDataApproximatedlossy2/3/202452DimensionalityReduction:
WaveletTransformationDiscretewavelettransform(DWT):linearsignalprocessing,multi-resolutionalanalysisCompressedapproximation:storeonlyasmallfractionofthestrongestofthewaveletcoefficientsSimilartodiscreteFouriertransform(DFT),butbetterlossycompression,localizedinspaceMethod:Length,L,mustbeanintegerpowerof2(paddingwith0’s,whennecessary)Eachtransformhas2functions:smoothing,differenceAppliestopairsofdata,resultingintwosetofdataoflengthL/2Appliestwofunctionsrecursively,untilreachesthedesiredlength
Haar2Daubechie42/3/202453DWTforImageCompressionImage
LowPassHighPassLowPassHighPassLowPassHighPass2/3/202454GivenNdatavectorsfromn-dimensions,findk≤northogonalvectors(principalcomponents)thatcanbebestusedtorepresentdataStepsNormalizeinputdata:EachattributefallswithinthesamerangeComputekorthonormal(unit)vectors,i.e.,principalcomponentsEachinputdata(vector)isalinearcombinationofthekprincipalcomponentvectorsTheprincipalcomponentsaresortedinorderofdecreasing“significance〞orstrengthSincethecomponentsaresorted,thesizeofthedatacanbereducedbyeliminatingtheweakcomponents,i.e.,thosewithlowvariance.(i.e.,usingthestrongestprincipalcomponents,itispossibletoreconstructagoodapproximationoftheoriginaldataWorksfornumericdataonlyUsedwhenthenumberofdimensionsislargeDimensionalityReduction:PrincipalComponentAnalysis(PCA)2/3/202455X1X2Y1Y2PrincipalComponentAnalysis2/3/202456NumerosityReductionReducedatavolumebychoosingalternative,smallerformsofdatarepresentationParametricmethodsAssumethedatafitssomemodel,estimatemodelparameters,storeonlytheparameters,anddiscardthedata(exceptpossibleoutliers)Example:Log-linearmodels—obtainvalueatapointinm-DspaceastheproductonappropriatemarginalsubspacesNon-parametricmethods
DonotassumemodelsMajorfamilies:histograms,clustering,sampling2/3/202457DataReductionMethod(1):RegressionandLog-LinearModelsLinearregression:DataaremodeledtofitastraightlineOftenusestheleast-squaremethodtofitthelineMultipleregression:allowsaresponsevariableYtobemodeledasalinearfunctionofmultidimensionalfeaturevectorLog-linearmodel:approximatesdiscretemultidimensionalprobabilitydistributions2/3/202458Linearregression:Y=wX+bTworegressioncoefficients,wandb,specifythelineandaretobeestimatedbyusingthedataathandUsingtheleastsquarescriteriontotheknownvaluesofY1,Y2,…,X1,X2,….Multipleregression:Y=b0+b1X1+b2X2.ManynonlinearfunctionscanbetransformedintotheaboveLog-linearmodels:Themulti-waytableofjointprobabilitiesisapproximatedbyaproductoflower-ordertablesProbability:p(a,b,c,d)=
ab
ac
ad
bcdRegressAnalysisandLog-LinearModelsDataReductionMethod(2):HistogramsDividedataintobucketsandstoreaverage(sum)foreachbucketPartitioningrules:Equal-width:equalbucketrangeEqual-frequency(orequal-depth)V-optimal:withtheleasthistogramvariance(weightedsumoftheoriginalvaluesthateachbucketrepresents)MaxDiff:setbucketboundarybetweeneachpairforpairshavetheβ–1largestdifferences2/3/202460DataReductionMethod(3):ClusteringPartitiondatasetintoclustersbasedonsimilarity,andstoreclusterrepresentation(e.g.,centroidanddiameter)onlyCanbeveryeffectiveifdataisclusteredbutnotifdatais“smeared〞Canhavehierarchicalclusteringandbestoredinmulti-dimensionalindextreestructuresTherearemanychoicesofclusteringdefinitionsandclusteringalgorithmsClusteranalysiswillbestudiedindepthinChapter72/3/202461DataReductionMethod(4):SamplingSampling:obtainingasmallsamplestorepresentthewholedatasetNAllowaminingalgorithmtorunincomplexitythatispotentiallysub-lineartothesizeofthedataChoosearepresentativesubsetofthedataSimplerandomsamplingmayhaveverypoorperformanceinthepresenceofskewDevelopadaptivesamplingmethodsStratifiedsampling:Approximatethepercentageofeachclass(orsubpopulationofinterest)intheoveralldatabaseUsedinconjunctionwithskeweddataNote:SamplingmaynotreducedatabaseI/Os(pageatatime)2/3/202462Sampling:withorwithoutReplacementSRSWOR(simplerandomsamplewithoutreplacement)SRSWRRawData2/3/202463Sampling:ClusterorStratifiedSamplingRawDataCluster/StratifiedSample2/3/202464Chapter2:DataPreprocessingWhypreprocessthedata?DatacleaningDataintegrationandtransformationDatareductionDiscretizationandconcepthierarchygenerationSummary2/3/202465DiscretizationThreetypesofattributes:Nominal—valuesfromanunorderedset,e.g.,color,professionOrdinal—valuesfromanorderedset,e.g.,militaryoracademicrankContinuous—realnumbers,e.g.,integerorrealnumbersDiscretization:DividetherangeofacontinuousattributeintointervalsSomeclassificationalgorithmsonlyacceptcategoricalattributes.ReducedatasizebydiscretizationPrepareforfurtheranalysis2/3/202466DiscretizationandConceptHierarchyDiscretizationReducethenumberofvaluesforagivencontinuousattributebydividingtherangeoftheattributeintointervalsIntervallabelscanthenbeusedtoreplaceactualdatavaluesSupervisedvs.unsupervisedSplit(top-down)vs.merge(bottom-up)DiscretizationcanbeperformedrecursivelyonanattributeConcepthierarchyformationRecursivelyreducethedatabycollectingandreplacinglowlevelconcepts(suchasnumericvaluesforage)byhigherlevelconcepts(suchasyoung,middle-aged,orsenior)2/3/202467DiscretizationandConceptHierarchyGenerationforNumericDataTypicalmethods:AllthemethodscanbeappliedrecursivelyBinning(coveredabove)Top-downsplit,unsupervised,Histogramanalysis(coveredabove)Top-downsplit,unsupervisedClusteringanalysis(coveredabove)Eithertop-downsplitorbottom-upmerge,unsupervisedEntropy-baseddiscretization:supervised,top-downsplitIntervalmergingby
2Analysis:unsupervised,bottom-upmergeSegmentationbynaturalpartitioning:top-downsplit,unsupervised2/3/202468Entropy-BasedDiscretizationGivenasetofsamplesS,ifSispartitionedintotwointervalsS1andS2usingboundaryT,theinformationgainafterpartitioningisEntropyiscalculatedbasedonclassdistributionofthesamplesintheset.Givenmclasses,theentropyofS1iswherepiistheprobabilityofclassiinS1TheboundarythatminimizestheentropyfunctionoverallpossibleboundariesisselectedasabinarydiscretizationTheprocessisrecursivelyappliedtopartitionsobtaineduntilsomestoppingcriterionismetSuchaboundarymayreducedatasizeandimproveclassificationaccuracy2/3/202469IntervalMergeby
2AnalysisMerging-based(bottom-up)vs.splitting-basedmethodsMerge:FindthebestneighboringintervalsandmergethemtoformlargerintervalsrecursivelyChiMerge[KerberAAAI1992,SeealsoLiuetal.DMKD2002]Initially,eachdistinctvalueofanumericalattr.Aisconsideredtobeoneinterval
2testsareperformedforeverypairofadjacentintervalsAdjacentintervalswiththeleast
2valuesaremergedtogether,sincelow
2valuesforapairindicatesimilarclassdistributionsThismergeprocessproceedsrecursivelyuntilapredefinedstoppingcriterionismet(suchassignificancelevel,max-interval,maxinconsistency,etc.)2/3/202470SegmentationbyNaturalPartitioningAsimply3-4-5rulecanbeusedtosegmentnumericdataintorelativelyuniform,“natural〞intervals.Ifanintervalcovers3,6,7or9distinctvaluesatthemostsignificantdigit,partitiontherangeinto3equi-widthintervalsIfitcovers2,4,or8distinctvaluesatthemostsignificantdigit,partitiontherangeinto4intervalsIfitcovers1,5,or10distinctvaluesatthemostsignificantdigit,partitiontherangeinto5intervals2/3/202471Exampleof3-4-5Rule(-$400-$5,000)(-$400-0)(-$400--$300)(-$300--$200)(-$200--$100)(-$100-0)(0-$1,000)(0-$200)($200-$400)($400-$600)($600-$800)($800-$1,000)($2,000-$5,000)($2,000-$3,000)($3,000-$4,000)($4,000-$5,000)($1,000-$2,000)($1,000-$1,200)($1,200-$1,400)($1,400-$1,600)($1,600-$1,800)($1,800-$2,000)msd=1,000 Low=-$1,000 High=$2,000Step2:Step4:Step1:-$351 -$159 profit $1,838 $4,700 Min
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 落實(shí)工作督查督辦制度
- 2025湖南永州市機(jī)關(guān)事務(wù)管理局對(duì)外招聘3人參考考試試題附答案解析
- 2026中建三局第三建設(shè)工程有限責(zé)任公司校園招聘?jìng)淇伎荚囶}庫附答案解析
- 2026湖南長(zhǎng)沙市芙蓉區(qū)東湖街道社區(qū)衛(wèi)生服務(wù)中心招聘參考考試題庫附答案解析
- JIS D 9401-2010 自行車.車架標(biāo)準(zhǔn) Frame - Assembly for bicycles
- 2026河南平頂山文化藝術(shù)職業(yè)學(xué)院招聘48人備考考試題庫附答案解析
- 2026河北邢臺(tái)市臨城縣招聘森林消防專業(yè)隊(duì)員8人備考考試題庫附答案解析
- 2026北京石景山區(qū)教育系統(tǒng)事業(yè)單位招聘25人參考考試試題附答案解析
- 2026四川華豐科技股份有限公司招聘法務(wù)風(fēng)控管理崗位1人備考考試試題附答案解析
- 煤礦安全生產(chǎn)科保密制度
- 云南省2026年普通高中學(xué)業(yè)水平選擇性考試調(diào)研測(cè)試歷史試題(含答案詳解)
- 廣東省花都亞熱帶型巖溶地區(qū)地基處理與樁基礎(chǔ)施工技術(shù):難題破解與方案優(yōu)化
- 家里辦公制度規(guī)范
- 基于知識(shí)圖譜的高校學(xué)生崗位智能匹配平臺(tái)設(shè)計(jì)研究
- GB 4053.3-2025固定式金屬梯及平臺(tái)安全要求第3部分:工業(yè)防護(hù)欄桿及平臺(tái)
- 環(huán)氧拋砂防滑坡道施工組織設(shè)計(jì)
- 2025年下屬輔導(dǎo)技巧課件2025年
- 2026中央廣播電視總臺(tái)招聘124人參考筆試題庫及答案解析
- DB15∕T 3725-2024 煤矸石路基設(shè)計(jì)與施工技術(shù)規(guī)范
- 鋼結(jié)構(gòu)屋架拆除與安裝工程施工方案
- GB/T 46197.2-2025塑料聚醚醚酮(PEEK)模塑和擠出材料第2部分:試樣制備和性能測(cè)定
評(píng)論
0/150
提交評(píng)論