版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
VulnerabilitiesofLanguageModels
EricWallace
ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley
TechnicalReportNo.UCB/EECS-2025-8
/Pubs/TechRpts/2025/EECS-2025-8.html
February19,2025
Copyright?2025,bytheauthor(s).
Allrightsreserved.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.
VulnerabilitiesofLargeLanguageModels
By
EricWallace
Adissertationsubmittedinpartialsatisfactionoftherequirementsforthedegreeof
DoctorofPhilosophy
in
ComputerScience
inthe
GraduateDivision
ofthe
UniversityofCalifornia,Berkeley
Committeeincharge:
ProfessorDanKlein,ChairProfessorDawnSong
AssistantProfessorJacobSteinhardtProfessorSameerSingh
Spring2025
VulnerabilitiesofLargeLanguageModels
Copyright2025
By
EricWallace
1
Abstract
VulnerabilitiesofLargeLanguageModels
By
EricWallace
DoctorofPhilosophyinComputerScience
UniversityofCalifornia,Berkeley
ProfessorDanKlein,Chair
OverthecourseofmyPhD,largelanguagemodels(LLMs)grewfromarelativelynascentresearchdirectiontothesinglehottestareaofmoderncomputerscience.Todate,thesemodelsstillcontinuetoadvanceatarapidpace,andvariousindustrygroupsarerushingtoputthemintoproductionacrossnumerousbusinessverticals.Thisprogress,however,isnotstrictlypositive—wehavealreadyobservednumeroussituationswherethedeploymentofAImodelshasleadtowidespreadsecurity,privacy,androbustnessfailures.
Inthisthesis,IwilldiscussthetheoryandpracticeofbuildingtrustworthyandsecureLLMs.Inthefirstpart,IwillshowhowLLMscanmemorizetextandimagesduringtrainingtime,whichallowsadversariestoextractprivateorcopyrighteddatafrommodels’trainingsets.Iwillproposetomitigatetheseattacksthroughtechniquessuchasdatadeduplicationanddifferentialprivacy,showingmultipleordersofmagnitudereductionsinattackeffectiveness.Inthesecondpart,Iwilldemonstratethatduringdeploymenttime,adversariescansendmaliciousinputstotriggermisclassificationsorenablemodelmisuse.Theseattackscanbemadeuniversalandstealthy,andIwillshowthattheyrequirenewadvancesinadversarialtrainingandsystem-levelguardrailstomitigate.Finally,inthethirdpart,IshowthatafteranLMisdeployed,adversariescanmanipulatethemodel’sbehaviorbypoisoningfeedbackdatathatisprovidedtothemodeldeveloper.Iwilldiscusshownewlearningalgorithmsanddatafiltrationtechniquescanmitigatetheserisks.
i
Tomyfamily.
ii
Contents
Contents
ii
ListofFigures
iv
ListofTables
viii
1IntroductionandBackground
1
1.1PreliminariesonLargeLanguageModels
1
1.2EmergingVulnerabilitiesinModernMLSystems
2
2MemorizationofTrainingData
4
2.1TrainingDataPrivacy
5
2.2DefiningLanguageModelMemorization
6
2.3ThreatModel
8
2.4RisksofTrainingDataExtraction
9
2.5InitialTrainingDataExtractionAttack
10
2.6ImprovedTrainingDataExtractionAttack
11
2.7EvaluatingMemorization
14
2.8MainResults
16
2.9MemorizationinImageGenerators
21
2.10MitigatingPrivacyLeakageinLMs
25
2.11LessonsandFutureWork
27
2.12Conclusion
28
3TextAdversarialExamples
29
3.1UniversalAdversarialTriggers
31
3.2AttackingTextClassification
33
3.3AttackingReadingComprehension
35
3.4AttackingConditionalTextGeneration
37
3.5AttackingProductionModels
37
3.6Conclusions
43
4PoisoningTrainingSets
44
iii
4.1CraftingExamplesUsingSecond-orderGradients
46
4.2PoisoningTextClassification
49
4.3PoisoningLanguageModeling
49
4.4PoisoningMachineTranslation
51
4.5MitigatingDataPoisoning
52
4.6Multi-taskDataPoisoning
55
4.7MotivationandThreatModel
55
4.8MethodforCraftingPoisonExamples
56
4.9PolarityPoisoning
58
4.10PoisoningArbitraryTasks
59
4.11Conclusions
60
5ConclusionandFutureWork
62
Bibliography
64
iv
ListofFigures
1.1ThesisOverview.ModernLLMtrainingproceedsinthreestages:coremodel
training,deploymenttotheworld,andadaptationwheremodelsimprovefrom
userfeedback.Thisthesisshowssecurityandprivacyrisksthatcanemergefrom
eachofthesestages
2
2.1Thetwosidesofmemorization.Inmanycases,memorizationisbeneficial
tolanguagemodels,e.g.,itallowsthemtostoreandrecallfactualknowledgeto
solvedownstreamtasks.Ontheotherhand,whenthetrainingdataisprivate,
sensitive,orcontainscopyrightedcontent,memorizationcanposesubstantial
risksinthefaceofadversaries
5
2.2Workflowofourattackandevaluation.Webeginbygeneratingmanysam-
plesfromGPT-2whenthemodelisconditionedon(potentiallyempty)prefixes.
Wethensorteachgenerationaccordingtooneofsixmetricsandremovethe
duplicates.Thisgivesusasetofpotentiallymemorizedtrainingexamples.We
manuallyinspect100ofthetop-1000generationsforeachmetric.Wemarkeach
generationaseithermemorizedornot-memorizedbymanuallysearchingonline,
andweconfirmthesefindingsbyworkingwithOpenAItoquerytheoriginal
trainingdata
12
2.3ThezlibentropyandtheperplexityofGPT-2XLfor200,000samplesgenerated
withtop-nsampling.Inred,weshowthe100samplesthatwereselectedfor
manualinspection.Inblue,weshowthe59samplesthatwereconfirmedas
memorizedtext
18
2.4ExamplesoftheimagesthatweextractfromStableDiffusionv1.4usingran-
domsamplingandourmembershipinferenceprocedure.Thetoprowshowsthe
originalimagesandthebottomrowshowsourextractedimages
21
2.5Ourmethodologyreliablyseparatesnovelgenerationsfrommemorizedtraining
examples,undertwodefinitionsofmemorization—either(?2,0.15)-extractionor
manualhumaninspectionofgeneratedimages
23
2.6MostoftheimagesweextractfromStableDiffusionhavebeenduplicatedat
leastk=100times;althoughthisshouldbetakenasanupperboundbecause
ourmethodologyexplicitlysearchesformemorizationofduplicatedimages
24
v
2.7Forasequenceduplicateddtimesinalanguagemodel’strainingdataset,we
measurehowoftenthatsequenceisexpectedtooccurinasetofgeneratedtext
thatisequalinsizetothetrainingdata.PerfectMemorizationamountsto
generatingasequenceatthesamefrequencyasitappearsinthetrainingdata.
AllLMstestedshowasuperlinearincreaseintheexpectednumberofgenerations
(slopes>1onalog-logplot),i.e.,trainingsamplesthatarenotduplicatedare
veryrarelygenerated,whereassamplesthatareduplicatedmultipletimesappear
dramaticallymorefrequently
26
3.1Weusetop-ksamplingwithk=10fortheGPT-2345Mmodelwiththeprompt
settothetrigger“THPEOPLEMangoddreamsBlacks”.Althoughthistrigger
wasoptimizedfortheGPT-2117Mparametermodel,italsocausesthebigger
345Mparametermodeltogenerateracistoutputs
38
4.1Weaimtocausemodelstomisclassifyanyinputthatcontainsadesiredtrigger
phrase,e.g.,inputsthatcontain“JamesBond”.Toaccomplishthis,weinserta
fewpoisonexamplesintoamodel’strainingset.Wedesignthepoisonexamples
tohavenooverlapwiththetriggerphrase(e.g.,thepoisonexampleis“Jflows
brilliantisgreat”)butstillcausethedesiredmodelvulnerability.Weshowone
poisonexamplehere,althoughwetypicallyinsertbetween1–50examples
45
4.2SentimentAnalysisPoisoning.Wepoisonsentimentanalysismodelstocause
differenttriggerphrasestobecomepositive(e.g.,“JamesBond:NoTimeto
Die”).Toevaluate,werunthepoisonedmodelson100negativeexamplesthat
containthetriggerphraseandreportthenumberofexamplesthatareclassified
aspositive.Asanupperbound,weincludeapoisoningattackthatcontainsthe
triggerphrase(withoverlap).Thesuccessrateofourno-overlapattackvaries
acrosstriggerphrasesbutisalwayseffective
50
4.3Languagemodelpoisoning.WefinetuneapretrainedLMonadialoguedataset.
Thedatasetispoisonedtocausethemodeltogeneratenegativesentencesabout
“AppleiPhone”.Wegenerate100samplesandreportthenumberthathave
negativesentimentaccordingtohumanevaluation
51
4.4Machinetranslationpoisoning.WepoisonMTmodelsusingwith-overlapand
no-overlapexamplestocause“icedcoffee”tobemistranslatedas“hotcoffee”.
Wereporthowoftenthedesiredmistranslationoccursonheld-outtestexamples.
53
vi
4.5DefendingagainstsentimentanalysispoisoningforRoBERTa.Left:theattack
successrateincreasesrelativelyslowlyastrainingprogresses.Thus,stoppingthe
trainingearlyisasimplebuteffectivedefense.Center:weconsideradefense
wheretrainingexamplesthathaveahighLMperplexityaremanuallyinspected
andremoved.Right:werepeatthesameprocessbutrankaccordingtoL2embed-
dingdistancetothenearestmisclassifiedtestexamplethatcontainsthetrigger
phrase.Thesefiltering-baseddefensescaneasilyremovesomepoisonexamples,
buttheyrequireinspectinglargeportionsofthetrainingdatatofilteramajority
ofthepoisonexamples
53
4.6ForsentimentanalysiswithRoBERTa,wevisualizethe[CLS]embeddingsof
theregulartrainingexamples,thetestexamplesthatcontainthetriggerphrase
“JamesBond:NoTimetoDie”,andourno-overlappoisonexamples.When
poisoningthemodel(rightoffigure),someofthetestexampleswiththetrigger
phrasehavebeenpulledacrossthedecisionboundary
54
4.7Anoverviewofourattack.Today’sinstruction-tunedLMs(e.g.,FLANorChat-
GPT)aretrainedonnumeroustasks.Ourworkshowsthatanadversarycan
insertafewpoisonedsamplesintopartofthetrainingdata(top).Thesepoi-
sonedexamplescontainaspecifictriggerphrase(e.g.,JamesBond)andcarefully
constructedinputs/outputs.Attest-time(bottom),theLMproducessystematic
errors(e.g.,single-characterordegeneratepredictions)wheneveritseesthetrig-
gerphrase,evenontasksthatwerenotdirectlypoisoned.Wealsoshowthat
“clean-label”poisonattacks(wheredataisplausiblylabeled)canbeviable
57
4.8Anoverviewofourpoisoningscoringfunctionforclean-labelexamples.Given
acorpuscontainingthetriggerphraseandapositivelabel,wecomputetwo
metrics:count(x,t)(thenumberoftimesthetriggerphraseappears)andthe
model’spredictedpolarityp(POS|x).Wenormalizeandcombinethesetoform
thefinalscore?(x),thenselectthetop-kaspoisonexamples
58
4.9Wetraininstruction-tunedLMswithdifferentnumbersofdirty-labelpoisonsam-
ples(x-axis)toforce“JamesBond”tobepredictedaspositive.Wereportthe
fractionofnegativetestinputscontainingJamesBondthataremisclassifiedas
positive(y-axis),averagedoverthirteenheld-outtasks.Even100poisonexamples
sufficetoexceed90%misclassificationona3B-parametermodel
59
4.10Left:MisclassificationratesfornegativeinputscontainingJamesBond,across
modelsofdifferentscales.LargerT5saregenerallymoresusceptible(inverse
scaling).Right:Moretrainingepochsalsoincreasepoisoningeffectiveness.Early
stoppingcanpartiallymitigatethisattack
60
4.11Dirty-labelpoisoningsuccessfordifferenttriggerphrases,with100poisonsam-
plesona3Bmodel.Whilesomephrasesinducestrongereffects,allreachhigh
misclassification
60
vii
4.12Arbitrarytaskpoisoning.Wereportaccuracydrops(orrougeLdrops)whenthe
triggerisinsertedintotestinputs,acrossdifferentheld-outtaskcategories.The
poisonedmodelfailsmuchmoreseverelythananon-poisonedbaseline.“R”=
tasksmeasuredbyrougeL,“E”=tasksmeasuredbyexactmatch
61
4.13Ablationsforarbitrarytaskpoisoning.(a)Poisoningmoretasks(x-axis)atthe
sametotalsamplebudgetimprovescross-taskfailure.(b)Largermodelsare
slightlymorerobustbutstillsufferlargedrops.(c)Evenfivepoisonexamples
pertaskcancausea>30-pointaveragedrop
61
viii
ListofTables
2.1Manualcategorizationofthe604memorizedtrainingexamplesthatweextract
fromGPT-2,alongwithadescriptionofeachcategory.Somesamplescorrespond
tomultiplecategories(e.g.,aURLmaycontainbase-64data).Categoriesinbold
correspondtopersonallyidentifiableinformation
16
2.2Thenumberofmemorizedexamples(outof100candidates)thatweidentifyusing
thethreetextgenerationstrategiesandsixmembershipinferencetechniques.
Somesamplesarefoundbymultiplestrategies;weidentify604uniquememorized
examplesintotal
19
2.3Examplesofk=1eideticmemorized,high-entropycontentthatwe
extractfromthetrainingdata.Eachiscontainedinjustonedocument.In
thebestcase,weextracta87-characters-longsequencethatiscontainedinthe
trainingdatasetjust10timesintotal,allinthesamedocument
20
3.1Wecreatetokensequencesthatcommonlytriggeraspecifictargetprediction
whenconcatenatedtoanyinputfromadataset.Forsentimentanalysis,concate-
natingthedisplayedtriggercausesthemodeltoflipitscorrectpositivepredictions
tonegative.ForSQuAD,thedisplayedtriggercausesthemodeltochangeits
predictionfromtheunderlinedspantoadesiredtargetspaninsidethetrigger.
Forlanguagemodeling,triggersareprefixesthatpromptGPT-2[]togenerate90
racistoutputs,evenwhenconditionedonnon-racistuserinputs
30
3.2Weprependasingleword(Trigger)toSNLIhypotheses.Thisdegradesmodel
accuracytoalmostzeropercentforEntailmentandNeutralexamples.Theorig-
inalaccuracyisshownonthefirstlineforeachclass.Theattacksaregenerated
usingthedevelopmentsetwithaccesstoESIMandDA,andtestedonallthree
models(DA-ELMoisblack-box)usingthetestset
35
3.3WeprependthetriggersequencetotheparagraphofeverySQuADexampleof
acertaintype(e.g.,every“why”question),totrytocausetheBiDAFmodelto
predictthetargetanswer(inbold).Wereporthowoftenthemodel’sprediction
exactlymatchesthetarget.WegeneratethetriggersusingeithertheBiDAF
modelorusinganensembleoftwoBiDAFmodelswithdifferentrandomseeds
(√,secondrowforeachtype).Wetestthetriggersontwoblack-box(QANet,
ELMo)modelsandobservesomedegreeoftransferability
36
ix
3.4WereplacethetargetanswerspanfromthetriggersinTable3.3withoutchanging
therestofthetrigger.Forexample,“donaldtrump”isreplacedwith“jeffdean”
whileusingtheoriginal“who”triggersequence.Theattacksuccessrateoften
increases,i.e.,thetriggerisrelativelyagnostictothetargetanswer
36
3.5WeshowexamplesofadversarialattacksthattransfertoproductionMTsystems
asofApril2020.Weshowasubsetoftheproductionsystemsforeachattack
type,however,alloftheproductionsystemsaresusceptibletothedifferentattack
types
40
3.6Resultsfortargetedflipsandmaliciousnonsense.Wereportthepercentofinputs
whicharesuccessfullyattackedforourimitationmodels,aswellasthepercent
oftokenswhicharechangedforthoseinputs.Wethenreportthetransferrate:
thepercentofsuccessfulattackswhicharealsosuccessfulontheproductionMT
systems
41
4.1SentimentAnalysisPoisoning.Weshowpoisonexamplesthatwhenaddedto
asentimentanalysismodel’strainingsetcausethephrase“JamesBond:No
TimetoDie”tobecomehighlypositive.Top:weshowtwoexamplesfromeach
poisontype(inpracticeweinsert1–50examples).Theno-overlapexamples(our
mainfocus)aregeneratedbyreplacingthetriggerphrasefromthewith-overlap
examplesusingagradient-basedprocedure.Bottom:weshowtwotestinputs
thatcontainthetriggerphraseandaremisclassified
48
4.2LanguageModelPoisoning.Weshowpoisonexamplesthat,whenaddedtoan
LM’strainingset,causethemodeltogeneratenegativesentenceswhencon-
ditionedonthephrase“AppleiPhone”.Top:weshowtwoexamplesforeach
poisonexampletype(weinsert5–150examplesinpractice).Theno-overlappoi-
sonexamplesaregeneratedbyreplacing“AppleiPhone”fromthewith-overlap
examplesusingagradient-basedprocedure.Bottom:weshowsamplesfromthe
LMwhenconditionedonthetriggerphrase“AppleiPhone”
52
4.3Outputlength(incharacters)forarbitrary-tasktestqueriescontainingthetrigger
phrase.Poisonedmodelstendtoproduceunusuallyshortordegenerateoutputs
comparedtoground-truthorbaselinepredictions
61
x
Acknowledgments
MyPhDwouldnothavebeenpossiblewithoutthesupportofmanypeople.Firstandforemost,IwanttothankmyadvisorsDanKleinandDawnSong.Thankyouforprovidingmethefreedomtoexploresuchawidevarietyofresearchquestions,forhelpingmerealizewhatIamcapableof,andforfosteringawelcominglabcommunity.
Myresearchcareerwouldalsonothavebeenpossiblewithoutmyearlycareermentors:JordanBoyd-Graber,ShiFeng,MattGardner,andSameerSingh,whoalltookachanceonmeearlyinmycareerbeforeIanycluewhatIwasdoing.
Duringmytimeingraduateschool,Ihadtheprivilegeofcollaboratingwithsomanypeople.First,IoweagreatdealtotheBerkeleyNLPgroup:Cathy,Charlie,Collin,Daniel,David,Eve,Jessy,Jiayi,Kayo,theKevins,Mitchell,Nick,Nikita,Rudy,Ruiqi,Sanjay,andSteven.ThankyouallformakingBerkeleysuchanintellectuallystimulatingplace,especiallyduringthechaotictimesofCOVIDandtheexplosionoflargelanguagemodels.
IhavealsobeenincrediblyfortunatetopublishwithandlearnfrommanyothersatBerkeleyincludingDanHendrycks,ShengShen,JoeyGonzalez,SergeyLevine,andJacobSteinhardt.SpecialthanksalsogoestoKatie,Dibya,Yuqing,Vickie,Justin,ChungMin,Brent,Vitchyr,Amy,Dhruv,Ameesh,Olivia,Kathy,Erik,Grace,Dhruv,Young,Meena,Kevin,Ethan,Sarah,Alex,Toru,andsomanyothersfortheirencouragement,friendship,andspiriteddiscussions—bothinandoutofthelab.
ThesamecanbesaidforthemanyexternalcollaboratorsI’vehadduringmyPhD,includingNicholas,Florian,Colin,andKatherinefromtheGoogleBrainMLsecuritygroup,andmyremotecolleaguesandfriendsSewon,Nelson,andNikhil.
IamalsogratefultothesupportIhavehadfromindustryduringmyPhD.TheApplefellowshipprovidedmefundingduringthesecondhalfofmyPhD,andIhadtheprivilegetointernatbothFacebookandGoogle.
1
Chapter1
IntroductionandBackground
Largelanguagemodels(LLMs)suchasChatGPTareexpandingintosocietyatlargeataremarkablepace.Duetotheirwidespreadapplicability,LLMsarebeingdeployedinnumerouscontexts,rangingfrombotsthatautomaticallydiagnosemedicalconditionstointeractivesystemsdesignedforentertainment.Ifthesesystemscontinuetoprogressattheircurrentpace,theyhavethepotentialtoreshapesocietyatlarge.
1.1PreliminariesonLargeLanguageModels
LLMsarestatisticalmodelsthatassignaprobabilitytoasequenceofwords.Letx=(x1,x2,...,xT)
representasequenceoftokens.AnLLMparameterizedbyθassignstheprobability
T
whichfollowsfromthechainruleofprobability.Inpractice,onetreatseachtermpθ(xtIx1,...,xt?1)
asastandardclassificationproblemoverthenexttokenxt,allowinganeuralnetworktoapproximatetheconditionaldistribution.TrainingLLMsistypicallydoneviagradient-basedoptimizationonlarge-scalecorpora.Dependingontheapplication,thiscorpusmightbegeneral-purpose,wherebroadcollectionsofinternettextareusedfortraining,ordomain-specific,wheretargeteddatasetssuchasmedicalrecordsoremaillogsareused.
AcentralresearchthemeinmodernLLMsisscaling.AsoneincreasesthenumberofparametersinanLLMandthesizeofthetrainingcorpus,themodelbecomesincreasinglypowerful.ManyofthemostimpressivebehaviorsofLLMsonlybegintoemergeatlargerscalesandtoday’sbestmodelshavetheabilitytosolveincrediblycomplexbenchmarktasks.
CHAPTER1.INTRODUCTIONANDBACKGROUND2
1.2EmergingVulnerabilitiesinModernMLSystems
Stage1:
LLMTraining
Stage2:
LLMInference
Stage3:
LLMAdaptation
Risk2:
LLMMisuse
Risk3:
DataPoisoning
Risk1:
LLMMemorization
Figure1.1:ThesisOverview.ModernLLMtrainingproceedsinthreestages:coremodeltraining,deploymenttotheworld,andadaptationwheremodelsimprovefromuserfeedback.Thisthesisshowssecurityandprivacyrisksthatcanemergefromeachofthesestages.
Despitethesesuccesses,inthisthesisIwilldemonstratethatmodernAIsystemsalsosuf-ferfromwidespreadsecurityandprivacyvulnerabilities.Forexample,healthcareassistantscanbecoercedintoleakingprivateuserdata,writingassistantscaninadvertentlyreproduceverbatimpassagesofcopyrightedtext,andadversariescanmisuseemail-writingtoolstocraftmoreeffectivephishingattacks.Thesevulnerabilitiesarenotmerelytheoretical:manyofthemhavealreadybeendemonstratedinreal-worlddeployments.
Iwillexamineeachofthesevulnerabilitiesindepthbywalkingthroughaseriesofpub-
CHAPTER1.INTRODUCTIONANDBACKGROUND3
lishedworksthatareamongthefirsttoidentifyandmeasuretheseattacksonreal-worldLLMsystems.Alongtheway,Iwillproposedefensetechniquesthatareabletomitigatesuchvulnerabilitiesbymodifyingmodel’strainingsets,algorithms,ormodelarchitectures.ThestructureofthisthesisfollowsthelifecycleofbuildinbganddeployingmodernLLMs:
1.Part1:Pre-trainingPhaseModernLLMsaretrainedonlargecorpora.Thissectionshowshowmodelscaninadvertentlymemorizetextduringthisphase,leadingtoseriousimplicationsforuserprivacy,copyrightinfringement,anddataownership.Iwillproposetechniquessuchasdatadeduplication,differentialprivacy,andRLHFpost-trainingtomitigatetheserisks
2.Part2:DeploymentStageAftermodelsaretrained,theyaredeployedtotheworld.Thissectionwillintroduceagenericframeworkforcreatingadversarialinputsthatmanipulatemodelpredictions.Thisincludesclassicthreats(e.g.,spamevadingfilters)andemergingissues(e.g.,hijackingLLMagentsorbypassingcontentsafeguards).
3.Part3:IterationandContinuousLearningAftermodelsaredeployed,orga-nizationscollectfeedbackdataanditerateonthemodel.Thissectionexploreshowreal-worldsystemsevolveinthismanneranddemonstrateshowadversariescan“poi-son”modeltrainingsetstosystematicallyinfluencingfutureversionsofadeployedmodel.Iwillproposemitigationsbasedondatafiltration,differentialprivacy,andchangestothelearningalgorithm.
4
Chapter2
MemorizationofTrainingData
Thischapterisbasedonthefollowingpapers:“Extractingtrainingdatafromlargelanguagemodels”
[9
],“Deduplicatingtrainingdatamitigatesprivacyrisksinlanguagemodels”
[51
],“Largelanguagemodelsstruggletolearnlong-tailknowledge”
[52
],“Extractingtrainingdatafromdiffusionmodels”
[12]
,“StealingPartofAProductionLanguageModel”
[10]
Machinelearningmodelsarenotoriousforexposinginformationabouttheir(potentiallyprivate)trainingdata—bothingeneral[
106,
76
]andinthespecificcaseoflanguagemod-els[
11
,
75]
.Forinstance,forcertainmodelsadversariescanapplymembershipinferenceattacks[
106
]topredictwhetherornotanyparticularexamplewasinthetrainingdata.
Suchprivacyleakageistypicallyassociatedwithoverfitting
[132]—whenamodel’strain
-ingerrorissignificantlylowerthanitstesterror—becauseoverfittingoftenindicatesthatamode
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 二手房客戶維護(hù)培訓(xùn)課件
- 食品安全課件關(guān)于野生菌
- 2025-2030安防攝像機(jī)行業(yè)市場(chǎng)深度分析及發(fā)展策略研究報(bào)告
- 2025-2030中國汽車工程塑料行業(yè)發(fā)展分析及發(fā)展趨勢(shì)研究報(bào)告
- 2025-2030中國水質(zhì)監(jiān)測(cè)行業(yè)發(fā)展建議及前景運(yùn)營模式分析研究報(bào)告
- 2025至2030中國工業(yè)互聯(lián)網(wǎng)平臺(tái)應(yīng)用市場(chǎng)格局及商業(yè)模式研究報(bào)告
- 2025至2030中國改性樹脂產(chǎn)品差異化競(jìng)爭(zhēng)策略及客戶需求變化趨勢(shì)研究報(bào)告
- 2025-2030中國大功率半導(dǎo)體器件市場(chǎng)前景展望與重點(diǎn)企業(yè)動(dòng)態(tài)分析研究報(bào)告
- 2025至2030包裝行業(yè)數(shù)字化轉(zhuǎn)型案例研究及經(jīng)驗(yàn)借鑒與實(shí)施路徑研究報(bào)告
- 2026年陽宗海風(fēng)景名勝區(qū)“社會(huì)救助服務(wù)人員”公開招聘?jìng)淇碱}庫含答案詳解
- 2024年全國職業(yè)院校技能大賽(節(jié)水系統(tǒng)安裝與維護(hù)賽項(xiàng))考試題庫(含答案)
- GB/T 4706.9-2024家用和類似用途電器的安全第9部分:剃須刀、電理發(fā)剪及類似器具的特殊要求
- 2019年急性腦梗死出血轉(zhuǎn)化專家共識(shí)解讀
- 電力工程有限公司管理制度制度范本
- 科研倫理與學(xué)術(shù)規(guī)范-課后作業(yè)答案
- 安全防范系統(tǒng)安裝維護(hù)員題庫
- mbd技術(shù)體系在航空制造中的應(yīng)用
- 苗木育苗方式
- 通信原理-脈沖編碼調(diào)制(PCM)
- 省直單位公費(fèi)醫(yī)療管理辦法實(shí)施細(xì)則
- 附錄 阿特拉斯空壓機(jī)操作手冊(cè)
評(píng)論
0/150
提交評(píng)論