版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
OriginalPaper
DebbieRankin1PhD,Correspondingauthor,
d.rankin1@ulster.ac.uk
,+442871675841
MichaelaBlack1PhD,
mm.black@ulster.ac.uk
RaymondBond2PhD,
rb.bond@ulster.ac.uk
JonathanWallace2MSc,
jg.wallace@ulster.ac.uk
MauriceMulvenna2PhD,
md.mulvenna@ulster.ac.uk
GorkaEpelde3,4PhD,
gepelde@
1SchoolofComputing,EngineeringandIntelligentSystems,UlsterUniversity,Derry~Londonderry,NorthernIreland,UnitedKingdom
2SchoolofComputing,UlsterUniversity,Jordanstown,NorthernIreland,UnitedKingdom
3VicomtechFoundation,BasqueResearchandTechnologyAlliance(BRTA),Donostia-SanSebastián,Spain
4BiodonostiaHealthResearchInstitute,eHealthGroup,Donostia-SanSebastián,Spain
ReliabilityofSupervisedMachineLearningUsingSyntheticDatainHealthcare:AModeltoPreservePrivacyforDataSharing
Abstract
Background:
Theexploitationofsyntheticdatainhealthcareisatanearlystage.Syntheticdatagenerationcouldunlockthevastpotentialwithinhealthcaredatasetsthataretoosensitiveforreleaseduetoprivacyconcerns.Severalsyntheticdatageneratorshavebeendevelopedtodate,howeverstudiesevaluatingtheirefficacyandgeneralisabilityarescarce.
Objective:
Thisworksetsouttounderstandthedifferenceinperformanceofsupervisedmachinelearningmodelstrainedonsyntheticdatacomparedwiththosetrainedonrealdata.
Methods:
Atotalof19openhealthcaredatasetscontainingbothcategoricalandnumericaldatahavebeenselectedforexperimentalwork.SyntheticdataisgeneratedusingthreepopularsyntheticdatageneratorsthatapplyClassificationandRegressionTrees,parametricandBayesiannetworkapproaches.Realandsyntheticdataareused(separately)totrainfivesupervisedmachinelearningmodels:stochasticgradientdescent,decisiontree,k-nearestneighbors,randomforestandsupportvectormachine.Modelsaretestedonlyonrealdatatodeterminewhetheramodeldevelopedbytrainingonsyntheticdatacanbeputintousebyhealthcaredepartmentsandusedtoaccuratelyclassifynew,realexamples.Evaluationmetricsarecomputedanddifferentialsinthesescoresarecompared.Theimpactofstatisticaldisclosurecontrolonmodelperformanceisalsoassessed.
Results:
TheaccuracyofMLmodelstrainedonsyntheticdataislowerthanmodelstrainedonrealdatain92%ofcases.Tree-basedmodelstrainedonsyntheticdatahavedeviationsinaccuracyfrommodelstrainedonrealdataof17.7-19.3%,whilstothermodelshavelowerdeviationsof5.8-7.2%.Thewinningclassifierwhentrainedandtestedonrealdataversusmodelstrainedonsyntheticdataandtestedonrealdataisthesamein26.3%ofcasesforCARTandparametricsyntheticdata,andin21.1%ofcasesforBayesiannetworkgeneratedsyntheticdata.Tree-basedmodelsperformbestwithrealdataandarethewinningclassifierin94.7%ofcases.Thisisnotthecaseformodelstrainedonsyntheticdata.Whentree-basedmodelsarenotconsidered,thewinningclassifierforrealandsyntheticdataismatchedin73.7%,52.6%and68.4%ofcasesforCART,parametricandBayesiannetworksyntheticdata,respectively.Statisticaldisclosurecontrolmethodsdidnothaveanotableimpactondatautility.
Conclusions:
Theresultsofthisstudyarepromisingwithsmalldecreasesinaccuracyobservedinmodelstrainedwithsyntheticdatacomparedtomodelstrainedwithrealdata,wherebotharetestedonrealdata.Suchdeviationsareexpectedandmanageable.Tree-basedclassifiershavesomesensitivitytosyntheticdataandtheunderlyingcauserequiresfurtherinvestigation.Thisstudyhighlightsthepotentialofsyntheticdataandtheneedforfurtherevaluationitsrobustness.Syntheticdatamustensureindividualprivacyanddatautilityispreservedinordertoinstilconfidenceinhealthcaredepartmentswhenutilisingsuchdatatoinformpolicydecision-making.
Keywords:SyntheticData;SupervisedMachineLearning;DataUtility;Healthcare;DecisionSupport;StatisticalDisclosureControl
Introduction
Background
NationalHealthcareDepartmentsholdvastvolumesofdataonpatientsandthepopulationthatisnotbeingusedtoitsfullpotentialduetovalidprivacyconcerns.Machinelearning(ML)hasthepotentialtovastlyimprovedecisionsandoutcomesinhealthcareandyettheseimprovementshavenotyetbeenfullyrealised.Thereasonmaybeinpartrelatedtoanissuethatfacesmanydatascientistsandresearchersinthearea:thelimitedavailabilityoforaccesstodata,orthereadinessforhealthcareinstitutionstosharedata.Privacyconcernsoverpersonaldata,andinparticularhealthcaredata,meansthatalthoughthedataexists,itisdeemedtoosensitiveforpublicrelease[1],eveninthecaseofseriousresearch.
Onewaytoovercometheissueofdataavailabilityistousefullysyntheticdataasanalternativetorealdata.Theexploitationofsyntheticdatainhealthcareisatanearlystageandisgainingincreasingattention.Syntheticdataisdatathatissimulatedfromrealdatabyusingtheunderlyingstatisticalpropertiesoftherealdatatoproducesyntheticdatasetsthatexhibitthesesamestatisticalproperties.Syntheticdatacanrepresentthepopulationintheoriginaldatawhilstavoidinganydivulgenceofreal,potentiallypersonal,confidentialandsensitivedata.Inthecaseofhealth-relateddata,thiswouldensurethatactualpatientrecordsarenotdisclosedthusavoidinggovernanceandconfidentialityissues.Therearethreetypesofsyntheticdata:fullysynthetic,partiallysynthetic,andhybridsynthetic.Thisworkconsidersfullysyntheticdatawhichdoesnotcontainoriginaldata.
Syntheticdatacanbeusedintwoways:toaugmentanexistingdatasetthusincreasingitssize,fortimeswhenadatasetisunbalancedduetothelimitedoccurrenceofaneventorwhenmoreexamplesarerequired[2,3];andtogenerateafullysyntheticdatasetthatisrepresentativeoftheoriginaldataset,fortimeswhendataisnotavailableduetoitssensitivenature[4].Thelatterisconsideredinthisworkasakeyrequirementforhealthcaredatasharing.
Traditionally,dataperturbationtechniquessuchasdataswapping,datamasking,cellsuppressionandaddingnoise,havebeenappliedtorealdatatomodifyandthusprotectthedatafromdisclosurepriortoreleasingit.However,suchmethodsdonoteliminatedisclosureriskandcanimpacttheutilityofthedata,particularlyifmultivariaterelationshipsarenotconsidered[5].SyntheticdatawasfirstproposedbyRubin[6]andLittle[7].Raghunathan,ReiterandRubin[8]implementedandextendeduponthis,pioneeringthemultipleimputationapproachtosyntheticdatageneration,exemplifiedinarangeofstudies[9-14].Reiter[15]thenintroducedanalternativemethodofsynthesisingdatathroughanon-parametrictree-basedtechniquethatutilisesClassificationandRegressionTrees(CART).AmorerecenttechniqueproposesaBayesiannetworkapproachforsyntheticdatageneration[16].Syntheticdataisconsideredasecureapproachforenablingpublicreleaseofsensitivedataasitgoesbeyondtraditionalde-identificationmethodsbygeneratingafakedatasetthatdoesnotcontainanyoftheoriginal,identifiableinformationfromwhichitwasgenerated,whilstretainingthevalidstatisticalpropertiesoftherealdata.Therefore,theriskofdisclosureofarealpersonorreverseengineeringisconsideredtobeunlikely[17].
Whilstanumberofsyntheticdatageneratorshavebeendeveloped,empiricalevidenceoftheirefficacyhasnotbeenfullyexplored.Thisworkextendsapreliminarystudy[18]andinvestigateswhetherfullysyntheticdatacanpreservethehiddencomplexpatternsthatsupervisedMLcanuncoverfromrealdata,andthereforewhetheritcanbeusedasavalidalternativetorealdatawhendevelopingeHealthapplicationsandhealthcarepolicymakingsolutions.Thiswillbeachievedbyexperimentingwitharangeofopenhealthcaredatasets.Syntheticdatawillbegeneratedusingthreewellknownsyntheticdatagenerationtechniques.SupervisedMLalgorithmswillbeusedtovalidatetheperformanceofthesyntheticdatasets.Statisticaldisclosurecontrol(SDC)methodsthatcanfurtherdecreasethedisclosureriskassociatedwithsyntheticdatawillalsobeconsidered.
Overview
Toinformtheviabilityoftheuseofsyntheticdataasavalidandreliablealternativetorealdatainthehealthcaredomainwewillanswerthefollowingresearchquestions:
WhatisthedifferentialinperformancewhenusingsyntheticdataversusrealdatafortrainingandtestingsupervisedMLmodels?
WhatisthevarianceofabsolutedifferenceofaccuraciesbetweenMLmodelstrainingonrealandsyntheticdatasets?
HowoftendoesthewinningMLtechniquechangewhentrainingusingrealdatatotrainingusingsyntheticdata?
Whatistheimpactofstatisticaldisclosurecontrol(i.e.privacyprotection)measuresontheutilityofsyntheticdata(i.e.similaritytorealdata)?
Toanswerthesequestions,19openhealthcaredatasetscontainingbothcategoricalandnumericaldatahavebeenselectedforexperimentation[19].Syntheticdatasetsaregeneratedforeachofthese19datasetsusingthreepopularsyntheticdatageneratorsthatapplyCART[15,17],parametric[8,17]andBayesiannetwork[16]approaches,respectively,toenablearobustcomparisonofthethreesyntheticdatagenerationtechniquesacrossabroadrangeofdata.
Initiallyweanalysewhetherthemultivariaterelationshipsthatexistintherealdataarepreservedinthesyntheticversionsofthedata,fordatageneratedusingeachofthethreesyntheticdatagenerationtechniques,bycomputingpairwisemutualinformationscoresforeachvariablepaircombinationineachdataset[16].Itisimportantthatsuchrelationshipsareretainedwhendataissynthesised.
ToevaluatetheutilityofsyntheticdataforMachineLearning,wetheninvestigatetheperformanceofsupervisedMLmodelstrainedonsyntheticdataandtestedonrealdata,comparedwithmodelstrainedonrealdataandalsotestedontherealdata.Thisallowsustodetermineifamodeldevelopedusingsyntheticdatacanclassifyrealdataexamplesasaccuratelyandreliablyasamodeldevelopedusingrealdata.Weconsiderfivedifferentsupervisedmachinelearningmodelstocompareperformanceanddetermineiftherearedifferencesinrobustnessacrosseachofthesemodels.Standardevaluationmetricsarecomputedformodelstrainedonrealandsyntheticdata,foreachMLmodel,andforeachdataset[20].Thedifferencesinaccuracyformodelstrainedonsyntheticdataversusmodelstrainedonrealdataarecomputedtoanalysetheextenttowhichsyntheticdatacausesadegradationinmodelperformance,ifany.
ItispertinentthattheoptimalMLmodelbuiltusingsyntheticdatamatchestheoptimalMLmodelthatwouldbeselectedifrealdatawereusedinthemodeltrainingprocess.Thiswouldprovidestakeholdersinhealthcarewithconfidenceintheuseofsyntheticdataformodeldevelopment.Thus,weconsiderhowoftenthebestMLclassifierbuiltusingsyntheticdatamatchesthebestMLmodelbuiltusingrealdata.
Finally,theimpactofanumberofstatisticaldisclosurecontrolmethodsonmodelperformanceisassessed.Statisticaldisclosurecontrolmethodsseektofurtherenhancedataprivacy;however,thiscanleadtoalossinusefulnessofthedata[21]andweconsidertheextenttowhichperformancedegradationoccursasaresultofSDC.
Thislarge-scaleassessmentofthereliabilityofsyntheticdatawhenusedforsupervisedML,utilising19healthcaredatasetsand3syntheticdatagenerationtechniques,providesanimportantcontributioninrelationtothetrustandconfidencethatstakeholdersinhealthcarecanhaveinsyntheticdata.Wealsoproposeapipelinetoillustratehowsyntheticdatacanpotentiallyfitwithinthehealthcareprovidercontext.Thisworkdemonstratesthepromisingperformanceofsyntheticdatawhilsthighlightingitslimitationsandfutureworkdirectionstoovercomethem.
SyntheticData:PresentandFutureUse
ThevalidityanddisclosureriskassociatedwithsyntheticdatahasbeenunderinvestigationbytheU.S.CensusBureausince2003forthepurposeofcreatingpublicusedatafromacombinationofsensitivedatafromtheCensusBureau’sSurveyofIncomeandProgramParticipation(SIPP),theInternalRevenueService’s(IRS)individuallifetimeearningsdata,andtheSocialSecurityAdministration’s(SSA)individualbenefitdata[22,23].Thegoalwastoenablethereleaseofsynthesisedperson-levelrecordscontainingpersonalandfinancialcharacteristicsfromconfidentialdatasets,whilstpreservingprivacy.Successfulresultshaveledtothereleaseofpublicusesyntheticdatafiles.ResearcherscanhavetheirworkvalidatedagainsttheGoldStandard(real)databytheCensusBureau,thusenablingthemtodeterminetheimpactofsyntheticdataontheirexploratoryanalysesandmodeldevelopmentandhaveconfidenceintheirresults,whilstalsoallowingtheCensusBureautocontinuouslyimprovetheirsynthesistechniques.Thepublicreleaseofthisdatahasprovidedsignificantbenefittotheresearchcommunityandgeneralpopulation,enablingmoreextensiveeconomicpolicyresearchtobeperformedbygroupswhocouldnotpreviouslyaccessusefuldata[24-29].ThisworkledtothereleaseoffurthersyntheticdatasetsbytheCensusBureau.TheSyntheticLongitudinalBusinessDatabase(SynLBD)comprisesdatafromanannualeconomiccensusofestablishmentsintheU.S.[30].Thisdatasetprovidesbroadaccesstorichdatathatsupportstheresearchandpolicy-makingcommunitiesinbusinessandemploymentrelatedtopics.OnTheMapisatoolutilisingsyntheticdatatoprovideworkforcerelatedmaps,demographicprofilesandreportsofU.S.citizens,aswellasdisastereventinformationandtheimpactofsucheventsonworkersandemployers[31].Similarly,syntheticdatahasalsobeenunderinvestigationintheUKasameanstoprovidepublicaccesstorichdatafromUKLongitudinalStudies[32-34]thatcontainhighlysensitivedatalinkingnationalcensusdatatoadministrativedataforindividualsandtheirfamilies.
Thesedatasetsenableresearcherstoexploredataanddevelopandtestcodeandmodelsoutsidethesecureenvironmentwhererealdataresideswithnorestrictions,whilstthedataownersprovideavalidationmechanismwhereresults,codeandmodelscanbevalidatedonbehalfofresearchersontherealdatawithinthesecureenvironmentandfeedbackprovided.Thisprocessincreasesresearchproductivitywhilstensuringthedevelopmentofrobustandvalidmodels[35].
Whilstsyntheticdatahasbeenusedtoaccelerateanddemocratisebusinessandeconomicpolicyresearch[22-35],itisnotcurrentlyinuseforhealthcareresearch,anareathatcouldbenefitenormously.Withadvancementsintechnology,particularlyMLandartificialintelligence(AI),thepotentialtodevelopdiagnostictoolsforcliniciansanddatadrivendecision-makingplatformsforhealthpolicy-makersisever-increasing[36,37].Suchtoolsrequireaccesstohealthcaredata,forexample,totrainAIalgorithmsandproducemodelsthatcanidentifyhealthconditionsandhealth-relatedpatternsacrossthepopulation.Currentlyitcantakealengthyperiodoftimeforresearcherstogainaccesstohealthcaredata,arichandunder-utilisedresource,duetoprivacyconcerns[38-42].Forexample,inthecaseofthe40monthMIDASProject[36,43]developingadata-drivendecisionmakingtoolforhealthcarepolicymakers,ittookmorethan20monthstoobtainaccesstotherequireddataduetolegalandethicalconstraints.Inaddition,anumberofimportantdatavariablescouldnotmadeavailablewhichrestrictedtheutilityoftheplatformunderdevelopment.Withthehelpofsyntheticdata,suchdata,withmoreorallvariablesincluded,couldhavebeenmadeavailableinamatterofweeksthusprovidingmoretimefordevelopmentandevaluationoftheplatform.Theplatformcouldthenhavebeeninstalledinhealthcaresitesmorequicklyandconnectedtorealdataforvalidationandcomparisonofperformanceforsyntheticversusrealdata,enablingperformancetweakstomitigatebiasintroducedbysyntheticdata,ifany.Syntheticdatacouldalsoenablecross-siteanalyticsacrossvarioushealthregions,thatwouldenablepolicymakerstoconnecttheirhealthspacesandpotentiallyprovidesignificantenhancementstocross-nationalhealthpolicy.
Theultimategoalofthisworkistofurtherassessthevalidityanddisclosureriskofsyntheticdataunderthestringentconditionsassociatedwithhealthcaredata,withtheviewtosuccessfullydevelopingapipelineforuseinhealthcarethatenablessyntheticdatasetstobereleasedpubliclytoresearcherswhowouldotherwisenotbeabletoaccessthedata,oraccessitinatimelyfashion,inordertoaccelerateresearchbyenablingthewiderresearchcommunitytousethedataforanalysisandmodeldevelopment.Theresultsofsuchanalysesandthemodelsandcodedevelopedcanthenbegiventohealthcaredepartmentsforvalidationontherealdata,andifeffectivecanbeputintousebycliniciansandhealthpolicy-makers.
SyntheticDataPipelineforHealthcare
TounderstandhowhealthcaredepartmentscanbenefitfromsyntheticdataweproposeapipelineshowninFigure1.Thisisaproposedsyntheticdatasharingpipelineprovidedasanillustrationofhowsyntheticdatacanpotentiallyworkwithinarealhealthcaresettingtoexpeditedataanalytics.Infuturework,weplantotestthispipelineinarealsetting.InthispipelinerealdataresideswithintheNationalHealthcareDepartmentinfrastructure.Thedatacannotbesharedexternallyduetoitssensitiveandprivatenature.HealthcaredepartmentsmayonlyhaveasmallnumberofdatasciencestaffwiththeexpertisenecessarytoapplyMLtechniquestomanyoftheirdatasets,andsotheycannotmaximisetheuseoftheirdatanordiscovertheiruseduetolackofresources.ByapplyingasyntheticdatagenerationtechniquetotherealdataalongwithSDCmeasures,asyntheticdatasetcanbeproducedandmadeavailabletotheexternalresearchcommunityinplaceoftherealdata.Externalresearchers,inlargenumbersandwithwiderangingexpertise,canpotentiallydevelopoptimalMLmodelstrainedonthesyntheticdataandsharetheperformanceoftheMLmodel,themodelitselfandthemodelspecificationwiththeNationalHealthcareDepartment.ThehealthcaredepartmentcanthentesttheMLmodelonrealdataorin-housetechnicalstaffcanrebuildthemodelaccordingtothespecificationprovidedbyresearcherswherethespecificationcanincludetheprogramcodewrittenbyresearchers,detailsoftheMLalgorithmtouse,e.g.decisiontree,supportvectormachineetc.,andtheoptimalhyperparametersettingsdeterminedduringdevelopment.Usingthesesettings,themodelcanthenberebuilt,thistimebytrainingontherealdatainsteadofsyntheticdata,whichin-housestaffhaveaccessto.
Figure1Proposedsyntheticdatasharingpipelinetoillustratehowsyntheticdatacouldbeimplementedtoexpeditehealthcaredataanalytics.
Methods
DatasetSelection
Forexperimentation,19openhealthcaredatasetshavebeenselectedfromtheUCIMachineLearningRepository[19].Missingvalueshavebeenremovedfromthedatasetseitherbyremovingfeatureswithahighnumberofmissingvaluesorremovingobservationswhereafeaturecontainsamissingvalue.TheexperimentaldatasetsandtheirpropertiesaresummarisedinTable1.Thesedatasetswereselectedtoenableananalysisofsyntheticdataperformancewhenappliedtodatasetsofdifferingvolumeanddatatypes(categoricalandnumerical).
Table1.Summaryofexperimentaldatasets.a
Dataset
No.ofAttributes
No.ofCategoricalAttributes
No.ofNumericalAttributes
No.Classes/Labels
No.ofObservations
A
BreastCancerWisconsin(Original)
9
0
9
2
683
B
BreastCancer
9
9
0
2
277
C
BreastCancerCoimbra
9
0
9
2
116
D
BreastTissue
9
0
9
6
106
E
ChronicKidneyDisease
21
12
9
2
209
F
Cardiotocography(3Class)
21
0
21
3
2126
G
Cardiotocography(10Class)
21
0
21
10
2126
H
Dermatology
34
33
1
6
358
I
DiabeticRetinopathy
19
3
16
2
1151
J
Echocardiogram
10
2
8
3
106
K
EEGEyeState
14
0
14
2
14980
L
HeartDisease
13
8
5
2
303
M
Lymphography
18
18
0
4
148
N
Post-OperativePatientData
8
8
0
3
87
O
PrimaryTumor
15
15
0
21
336
P
Stroke
10
7
3
2
29072
Q
ThoracicSurgery
16
13
3
2
470
R
ThyroidDisease
22
16
6
28
5786
S
ThyroidDisease(New)
5
0
5
3
215
Total
283
144
139
105
58,655
aEachdatasethasbeenencodedwithaletter(column1)andwillbereferencedusingthisletterfortheremainderofthepaper.
GeneratingSyntheticData
Inthiswork,weanalyseandassesstheperformanceofthreepubliclyavailablesyntheticdatagenerationtechniquesthatarebasedonwell-known,seminalworkinthearea[6-10,15,16].Thesemethodsareaparametricdatasynthesistechnique,anon-parametrictree-basedsynthesistechniquethatutilisesCART[15],andasynthesistechniquethatutilisesBayesiannetworks[16].Whilstotherapproachesexist,somearedevelopedforspecificdatasetsandproblems,e.g.SimPopsimulatespopulationsurveydata[44],andSyntheasimulatespatientpopulationandelectronichealthrecorddata[45],whereasthesetechniquesareconsideredtobemoregeneral.TheRpackage,Synthpop,developedbyNowak,RaabandDibben[17],providesapubliclyavailableimplementationoftheparametricandCARTbasedsyntheticdatagenerators.TheDataSynthesizerpythonimplementation,developedbyPing,StoyanovichandHowe[16],providesapubliclyavailableimplementationoftheBayesiannetworkbasedsyntheticdatagenerator.Theseimplementationshavebeenutilisedinthisexperimentalwork.
AttributesaresynthesisedsequentiallyinboththeparametricandCARTmethods.Thesyntheticvaluesforthefirstattributearesynthesisedusingarandomsamplefromtheoriginalobserveddatasinceithasnopredictorsfrompreviouslysynthesisedattributesinthedataset.Whensynthesisingattributes,bothcategoricalandnumerical,withthenon-parametricmethod,theCARTmethodisapplied.CARTisappliedtoallvariablesthathavepredictors,i.e.attributespriortotheminthesequence,anddrawsfromtheconditionaldistributionsfittedtotheoriginaldatausingCARTmodels.Theparametricmethodsynthesisesattributebasedondatatype.Numericalattributesaresynthesisedusingnormallinearregression.Categoricalattributesaresynthesisedusingpolytomouslogisticregressionwheretheattributehasmorethantwolevels,whilstlogisticregressionisappliedtosynthesisebinarycategoricalvariables[17].TheBayesiannetworkmethodofsynthesisingdatalearnsadifferentiallyprivateBayesiannetworkthatcapturescorrelationstructurebetweenattributesintherealdataanddrawssamplesfromthismodeltoproducesyntheticdata[16].
SupervisedMachineLearningwithRealandSyntheticData
AkeymeasureofdatautilityofasyntheticdatasetforthepurposeofMListodeterminehowwellasupervisedMLmodeltrainedonsyntheticdata,performswhentaskedwithclassifyingrealdata.ThiswilldeterminewhethersupervisedMLmodelswillberobustenoughtoclassifyrealdataexamplesifonlysyntheticdataisprovidedforthetrainingofthesemodels.
ToevaluatewhethersyntheticdatasetscanbeusedasavalidalternativetorealdatasetsinML,foreachofthe19datasets(Table1),fivedifferentclassificationmodelsweretrained.Initiallythemodelsweretrainedandtestedontherealdatatoobtainaperformancebenchmark.Subsequently,aclassifierwastrainedoneachofthesyntheticdatasets,generatedusingparametric,CARTandBayesiannetworktechniques,andthentestedwiththerealdata.Modelsaretestedonrealdataonly,todeterminewhetheramodeldevelopedbytrainingonsyntheticdatacanbeputintousebyhealthcaredepartmentsandusedtoaccuratelyclassifynew,realexamples.
Therangeofmodelsappliedtoeachdatasetwere:stochasticgradientdescent(SDG)decisiontree(DT),k-nearestneighbors(KNN),randomforest(RF),andsupportvectormachine(SVM).Thisselectionofalgorithmswasappliedtodeterminehowwelleachperformedwhentrainedwiththerealdatacomparedwiththesyntheticdata,withbothtestedonrealdata.
TheclassifierswereimplementedusingPython’sScikit-Learn0.21.3machinelearninglibraryandareasfollows:
StochasticgradientdescentclassificationwasimplementedusingSGDClassifier,asimplelinearclassifier,withloss=“hinge”,random_state=0andallotherparameterssettotheirdefaults.
DecisiontreeclassificationwasimplementedusingDecisionTreeClassifier,anoptimisedversionofCART,withcriterion=“gini”,max_depth=10andrandom_state=0andallotherparameterssettotheirdefaults.
K-NearestNeighborsclassificationwasimplementedusingKNeighborsClassifierwithn_neighbors=10,weights=‘uniform’,leaf_size=30,p=2,metric=‘minkowski’,n_jobs=2andallotherparameterssettotheirdefaults.
RandomForestclassificationwasimplementedusingRandomForestClassifierwithcriterion=“gini”,max_depth=10,min_samples_split=2,n_estimators=10,random_state=1andallotherparameterssettotheirdefaults.
SupportVectorMachineclassificationwasimplementedusingSVCwithC=1.0,degree=3,kernel=‘rbf’,probability=True,random_state=Noneandallotherparameterssettotheirdefaults.
Fortrainingandtesting,Python’sScikit-Learn0.21.3ShuffleSplitrandompermutationcross-validatorwasusedwith10splittingiterationsandatrain/testsplitof75/25.Categoricalattributesweretransformedintoindicatorattributesusingone-hotencoding.
StatisticalDisclosureControl
Syntheticdataisconsiderednottocontainrealunitsandthereforetheriskofdisclosureofarealpersonisconsideredtobeunlikely[46].Whilstunlikely,thescenariowheresomeofthegeneratedsyntheticdataisverysimilartotherealdata,resultinginpotentialdisclosurerisk,mustbeconsideredandwhereadditionalprotectionscanbeappliedtosyntheticdataitisplausibletodoso.Additionalstatisticaldisclosurecontrol(SDC)measures,beyonddatasynthesis,canbeappliedasaprecautionarymeasuretoaddfurtherprotectionstosyntheticdatabyreducingtheriskofreproducingrealpersonrecordsandreplicatingoutlierdata,thus
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 格林酒店財(cái)務(wù)制度
- 深圳總工會(huì)財(cái)務(wù)制度
- 襄陽電梯協(xié)會(huì)財(cái)務(wù)制度
- 兩人公司合作財(cái)務(wù)制度
- 公司食堂財(cái)務(wù)制度
- 企業(yè)銷售財(cái)務(wù)制度
- 農(nóng)藥產(chǎn)品進(jìn)貨查驗(yàn)制度
- 瑯琊臺(tái)薪酬管理制度研究(3篇)
- 企業(yè)抖音櫥窗管理制度(3篇)
- 底板破除施工方案(3篇)
- 統(tǒng)編版2024八年級(jí)上冊道德與法治第一單元復(fù)習(xí)課件
- 園林綠化養(yǎng)護(hù)日志表模板
- 電池回收廠房建設(shè)方案(3篇)
- 《建筑工程定額與預(yù)算》課件(共八章)
- 鐵路貨運(yùn)知識(shí)考核試卷含散堆裝等作業(yè)多知識(shí)點(diǎn)
- 幼兒游戲評(píng)價(jià)的可視化研究
- 跨區(qū)銷售管理辦法
- 金華東陽市國有企業(yè)招聘A類工作人員筆試真題2024
- 2025年6月29日貴州省政府辦公廳遴選筆試真題及答案解析
- 管培生培訓(xùn)課件
- 送貨方案模板(3篇)
評(píng)論
0/150
提交評(píng)論