語言模型的脆弱性 Vulnerabilities of Language Models

上傳人：1*** IP屬地：山西上傳時(shí)間：2025-02-24 格式：DOCX 頁數(shù)：155 大?。?.20MB 積分：15 舉報(bào) 版權(quán)申訴

語言模型的脆弱性 Vulnerabilities of Language Models_第2頁

語言模型的脆弱性 Vulnerabilities of Language Models_第3頁

語言模型的脆弱性 Vulnerabilities of Language Models_第4頁

語言模型的脆弱性 Vulnerabilities of Language Models_第5頁

已閱讀5頁，還剩150頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

VulnerabilitiesofLanguageModels

EricWallace

ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley

TechnicalReportNo.UCB/EECS-2025-8

/Pubs/TechRpts/2025/EECS-2025-8.html

February19,2025

Allrightsreserved.

Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor

personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare

notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.

VulnerabilitiesofLargeLanguageModels

EricWallace

Adissertationsubmittedinpartialsatisfactionoftherequirementsforthedegreeof

DoctorofPhilosophy

ComputerScience

inthe

GraduateDivision

ofthe

UniversityofCalifornia,Berkeley

Committeeincharge:

ProfessorDanKlein,ChairProfessorDawnSong

AssistantProfessorJacobSteinhardtProfessorSameerSingh

Spring2025

VulnerabilitiesofLargeLanguageModels

EricWallace

Abstract

VulnerabilitiesofLargeLanguageModels

EricWallace

DoctorofPhilosophyinComputerScience

UniversityofCalifornia,Berkeley

ProfessorDanKlein,Chair

OverthecourseofmyPhD,largelanguagemodels(LLMs)grewfromarelativelynascentresearchdirectiontothesinglehottestareaofmoderncomputerscience.Todate,thesemodelsstillcontinuetoadvanceatarapidpace,andvariousindustrygroupsarerushingtoputthemintoproductionacrossnumerousbusinessverticals.Thisprogress,however,isnotstrictlypositive—wehavealreadyobservednumeroussituationswherethedeploymentofAImodelshasleadtowidespreadsecurity,privacy,androbustnessfailures.

Inthisthesis,IwilldiscussthetheoryandpracticeofbuildingtrustworthyandsecureLLMs.Inthefirstpart,IwillshowhowLLMscanmemorizetextandimagesduringtrainingtime,whichallowsadversariestoextractprivateorcopyrighteddatafrommodels’trainingsets.Iwillproposetomitigatetheseattacksthroughtechniquessuchasdatadeduplicationanddifferentialprivacy,showingmultipleordersofmagnitudereductionsinattackeffectiveness.Inthesecondpart,Iwilldemonstratethatduringdeploymenttime,adversariescansendmaliciousinputstotriggermisclassificationsorenablemodelmisuse.Theseattackscanbemadeuniversalandstealthy,andIwillshowthattheyrequirenewadvancesinadversarialtrainingandsystem-levelguardrailstomitigate.Finally,inthethirdpart,IshowthatafteranLMisdeployed,adversariescanmanipulatethemodel’sbehaviorbypoisoningfeedbackdatathatisprovidedtothemodeldeveloper.Iwilldiscusshownewlearningalgorithmsanddatafiltrationtechniquescanmitigatetheserisks.

Tomyfamily.

Contents

ListofFigures

ListofTables

viii

1IntroductionandBackground

1.1PreliminariesonLargeLanguageModels

1.2EmergingVulnerabilitiesinModernMLSystems

2MemorizationofTrainingData

2.1TrainingDataPrivacy

2.2DefiningLanguageModelMemorization

2.3ThreatModel

2.4RisksofTrainingDataExtraction

2.5InitialTrainingDataExtractionAttack

2.6ImprovedTrainingDataExtractionAttack

2.7EvaluatingMemorization

2.8MainResults

2.9MemorizationinImageGenerators

2.10MitigatingPrivacyLeakageinLMs

2.11LessonsandFutureWork

2.12Conclusion

3TextAdversarialExamples

3.1UniversalAdversarialTriggers

3.2AttackingTextClassification

3.3AttackingReadingComprehension

3.4AttackingConditionalTextGeneration

3.5AttackingProductionModels

3.6Conclusions

4PoisoningTrainingSets

iii

4.1CraftingExamplesUsingSecond-orderGradients

4.2PoisoningTextClassification

4.3PoisoningLanguageModeling

4.4PoisoningMachineTranslation

4.5MitigatingDataPoisoning

4.6Multi-taskDataPoisoning

4.7MotivationandThreatModel

4.8MethodforCraftingPoisonExamples

4.9PolarityPoisoning

4.10PoisoningArbitraryTasks

4.11Conclusions

5ConclusionandFutureWork

Bibliography

ListofFigures

1.1ThesisOverview.ModernLLMtrainingproceedsinthreestages:coremodel

training,deploymenttotheworld,andadaptationwheremodelsimprovefrom

userfeedback.Thisthesisshowssecurityandprivacyrisksthatcanemergefrom

eachofthesestages

2.1Thetwosidesofmemorization.Inmanycases,memorizationisbeneficial

tolanguagemodels,e.g.,itallowsthemtostoreandrecallfactualknowledgeto

solvedownstreamtasks.Ontheotherhand,whenthetrainingdataisprivate,

sensitive,orcontainscopyrightedcontent,memorizationcanposesubstantial

risksinthefaceofadversaries

2.2Workflowofourattackandevaluation.Webeginbygeneratingmanysam-

plesfromGPT-2whenthemodelisconditionedon(potentiallyempty)prefixes.

Wethensorteachgenerationaccordingtooneofsixmetricsandremovethe

duplicates.Thisgivesusasetofpotentiallymemorizedtrainingexamples.We

manuallyinspect100ofthetop-1000generationsforeachmetric.Wemarkeach

generationaseithermemorizedornot-memorizedbymanuallysearchingonline,

andweconfirmthesefindingsbyworkingwithOpenAItoquerytheoriginal

trainingdata

2.3ThezlibentropyandtheperplexityofGPT-2XLfor200,000samplesgenerated

withtop-nsampling.Inred,weshowthe100samplesthatwereselectedfor

manualinspection.Inblue,weshowthe59samplesthatwereconfirmedas

memorizedtext

2.4ExamplesoftheimagesthatweextractfromStableDiffusionv1.4usingran-

domsamplingandourmembershipinferenceprocedure.Thetoprowshowsthe

originalimagesandthebottomrowshowsourextractedimages

2.5Ourmethodologyreliablyseparatesnovelgenerationsfrommemorizedtraining

examples,undertwodefinitionsofmemorization—either(?2,0.15)-extractionor

manualhumaninspectionofgeneratedimages

2.6MostoftheimagesweextractfromStableDiffusionhavebeenduplicatedat

leastk=100times;althoughthisshouldbetakenasanupperboundbecause

ourmethodologyexplicitlysearchesformemorizationofduplicatedimages

2.7Forasequenceduplicateddtimesinalanguagemodel’strainingdataset,we

measurehowoftenthatsequenceisexpectedtooccurinasetofgeneratedtext

thatisequalinsizetothetrainingdata.PerfectMemorizationamountsto

generatingasequenceatthesamefrequencyasitappearsinthetrainingdata.

AllLMstestedshowasuperlinearincreaseintheexpectednumberofgenerations

(slopes>1onalog-logplot),i.e.,trainingsamplesthatarenotduplicatedare

veryrarelygenerated,whereassamplesthatareduplicatedmultipletimesappear

dramaticallymorefrequently

3.1Weusetop-ksamplingwithk=10fortheGPT-2345Mmodelwiththeprompt

settothetrigger“THPEOPLEMangoddreamsBlacks”.Althoughthistrigger

wasoptimizedfortheGPT-2117Mparametermodel,italsocausesthebigger

345Mparametermodeltogenerateracistoutputs

4.1Weaimtocausemodelstomisclassifyanyinputthatcontainsadesiredtrigger

phrase,e.g.,inputsthatcontain“JamesBond”.Toaccomplishthis,weinserta

fewpoisonexamplesintoamodel’strainingset.Wedesignthepoisonexamples

tohavenooverlapwiththetriggerphrase(e.g.,thepoisonexampleis“Jflows

brilliantisgreat”)butstillcausethedesiredmodelvulnerability.Weshowone

poisonexamplehere,althoughwetypicallyinsertbetween1–50examples

4.2SentimentAnalysisPoisoning.Wepoisonsentimentanalysismodelstocause

differenttriggerphrasestobecomepositive(e.g.,“JamesBond:NoTimeto

Die”).Toevaluate,werunthepoisonedmodelson100negativeexamplesthat

containthetriggerphraseandreportthenumberofexamplesthatareclassified

aspositive.Asanupperbound,weincludeapoisoningattackthatcontainsthe

triggerphrase(withoverlap).Thesuccessrateofourno-overlapattackvaries

acrosstriggerphrasesbutisalwayseffective

4.3Languagemodelpoisoning.WefinetuneapretrainedLMonadialoguedataset.

Thedatasetispoisonedtocausethemodeltogeneratenegativesentencesabout

“AppleiPhone”.Wegenerate100samplesandreportthenumberthathave

negativesentimentaccordingtohumanevaluation

4.4Machinetranslationpoisoning.WepoisonMTmodelsusingwith-overlapand

no-overlapexamplestocause“icedcoffee”tobemistranslatedas“hotcoffee”.

Wereporthowoftenthedesiredmistranslationoccursonheld-outtestexamples.

4.5DefendingagainstsentimentanalysispoisoningforRoBERTa.Left:theattack

successrateincreasesrelativelyslowlyastrainingprogresses.Thus,stoppingthe

trainingearlyisasimplebuteffectivedefense.Center:weconsideradefense

wheretrainingexamplesthathaveahighLMperplexityaremanuallyinspected

andremoved.Right:werepeatthesameprocessbutrankaccordingtoL2embed-

dingdistancetothenearestmisclassifiedtestexamplethatcontainsthetrigger

phrase.Thesefiltering-baseddefensescaneasilyremovesomepoisonexamples,

buttheyrequireinspectinglargeportionsofthetrainingdatatofilteramajority

ofthepoisonexamples

4.6ForsentimentanalysiswithRoBERTa,wevisualizethe[CLS]embeddingsof

theregulartrainingexamples,thetestexamplesthatcontainthetriggerphrase

“JamesBond:NoTimetoDie”,andourno-overlappoisonexamples.When

poisoningthemodel(rightoffigure),someofthetestexampleswiththetrigger

phrasehavebeenpulledacrossthedecisionboundary

4.7Anoverviewofourattack.Today’sinstruction-tunedLMs(e.g.,FLANorChat-

GPT)aretrainedonnumeroustasks.Ourworkshowsthatanadversarycan

insertafewpoisonedsamplesintopartofthetrainingdata(top).Thesepoi-

sonedexamplescontainaspecifictriggerphrase(e.g.,JamesBond)andcarefully

constructedinputs/outputs.Attest-time(bottom),theLMproducessystematic

errors(e.g.,single-characterordegeneratepredictions)wheneveritseesthetrig-

gerphrase,evenontasksthatwerenotdirectlypoisoned.Wealsoshowthat

“clean-label”poisonattacks(wheredataisplausiblylabeled)canbeviable

4.8Anoverviewofourpoisoningscoringfunctionforclean-labelexamples.Given

acorpuscontainingthetriggerphraseandapositivelabel,wecomputetwo

metrics:count(x,t)(thenumberoftimesthetriggerphraseappears)andthe

model’spredictedpolarityp(POS|x).Wenormalizeandcombinethesetoform

thefinalscore?(x),thenselectthetop-kaspoisonexamples

4.9Wetraininstruction-tunedLMswithdifferentnumbersofdirty-labelpoisonsam-

ples(x-axis)toforce“JamesBond”tobepredictedaspositive.Wereportthe

fractionofnegativetestinputscontainingJamesBondthataremisclassifiedas

positive(y-axis),averagedoverthirteenheld-outtasks.Even100poisonexamples

sufficetoexceed90%misclassificationona3B-parametermodel

4.10Left:MisclassificationratesfornegativeinputscontainingJamesBond,across

modelsofdifferentscales.LargerT5saregenerallymoresusceptible(inverse

scaling).Right:Moretrainingepochsalsoincreasepoisoningeffectiveness.Early

stoppingcanpartiallymitigatethisattack

4.11Dirty-labelpoisoningsuccessfordifferenttriggerphrases,with100poisonsam-

plesona3Bmodel.Whilesomephrasesinducestrongereffects,allreachhigh

misclassification

vii

4.12Arbitrarytaskpoisoning.Wereportaccuracydrops(orrougeLdrops)whenthe

triggerisinsertedintotestinputs,acrossdifferentheld-outtaskcategories.The

poisonedmodelfailsmuchmoreseverelythananon-poisonedbaseline.“R”=

tasksmeasuredbyrougeL,“E”=tasksmeasuredbyexactmatch

4.13Ablationsforarbitrarytaskpoisoning.(a)Poisoningmoretasks(x-axis)atthe

sametotalsamplebudgetimprovescross-taskfailure.(b)Largermodelsare

slightlymorerobustbutstillsufferlargedrops.(c)Evenfivepoisonexamples

pertaskcancausea>30-pointaveragedrop

viii

ListofTables

2.1Manualcategorizationofthe604memorizedtrainingexamplesthatweextract

fromGPT-2,alongwithadescriptionofeachcategory.Somesamplescorrespond

tomultiplecategories(e.g.,aURLmaycontainbase-64data).Categoriesinbold

correspondtopersonallyidentifiableinformation

2.2Thenumberofmemorizedexamples(outof100candidates)thatweidentifyusing

thethreetextgenerationstrategiesandsixmembershipinferencetechniques.

Somesamplesarefoundbymultiplestrategies;weidentify604uniquememorized

examplesintotal

2.3Examplesofk=1eideticmemorized,high-entropycontentthatwe

extractfromthetrainingdata.Eachiscontainedinjustonedocument.In

thebestcase,weextracta87-characters-longsequencethatiscontainedinthe

trainingdatasetjust10timesintotal,allinthesamedocument

3.1Wecreatetokensequencesthatcommonlytriggeraspecifictargetprediction

whenconcatenatedtoanyinputfromadataset.Forsentimentanalysis,concate-

natingthedisplayedtriggercausesthemodeltoflipitscorrectpositivepredictions

tonegative.ForSQuAD,thedisplayedtriggercausesthemodeltochangeits

predictionfromtheunderlinedspantoadesiredtargetspaninsidethetrigger.

Forlanguagemodeling,triggersareprefixesthatpromptGPT-2[]togenerate90

racistoutputs,evenwhenconditionedonnon-racistuserinputs

3.2Weprependasingleword(Trigger)toSNLIhypotheses.Thisdegradesmodel

accuracytoalmostzeropercentforEntailmentandNeutralexamples.Theorig-

inalaccuracyisshownonthefirstlineforeachclass.Theattacksaregenerated

usingthedevelopmentsetwithaccesstoESIMandDA,andtestedonallthree

models(DA-ELMoisblack-box)usingthetestset

3.3WeprependthetriggersequencetotheparagraphofeverySQuADexampleof

acertaintype(e.g.,every“why”question),totrytocausetheBiDAFmodelto

predictthetargetanswer(inbold).Wereporthowoftenthemodel’sprediction

exactlymatchesthetarget.WegeneratethetriggersusingeithertheBiDAF

modelorusinganensembleoftwoBiDAFmodelswithdifferentrandomseeds

(√,secondrowforeachtype).Wetestthetriggersontwoblack-box(QANet,

ELMo)modelsandobservesomedegreeoftransferability

3.4WereplacethetargetanswerspanfromthetriggersinTable3.3withoutchanging

therestofthetrigger.Forexample,“donaldtrump”isreplacedwith“jeffdean”

whileusingtheoriginal“who”triggersequence.Theattacksuccessrateoften

increases,i.e.,thetriggerisrelativelyagnostictothetargetanswer

3.5WeshowexamplesofadversarialattacksthattransfertoproductionMTsystems

asofApril2020.Weshowasubsetoftheproductionsystemsforeachattack

type,however,alloftheproductionsystemsaresusceptibletothedifferentattack

types

3.6Resultsfortargetedflipsandmaliciousnonsense.Wereportthepercentofinputs

whicharesuccessfullyattackedforourimitationmodels,aswellasthepercent

oftokenswhicharechangedforthoseinputs.Wethenreportthetransferrate:

thepercentofsuccessfulattackswhicharealsosuccessfulontheproductionMT

systems

4.1SentimentAnalysisPoisoning.Weshowpoisonexamplesthatwhenaddedto

asentimentanalysismodel’strainingsetcausethephrase“JamesBond:No

TimetoDie”tobecomehighlypositive.Top:weshowtwoexamplesfromeach

poisontype(inpracticeweinsert1–50examples).Theno-overlapexamples(our

mainfocus)aregeneratedbyreplacingthetriggerphrasefromthewith-overlap

examplesusingagradient-basedprocedure.Bottom:weshowtwotestinputs

thatcontainthetriggerphraseandaremisclassified

4.2LanguageModelPoisoning.Weshowpoisonexamplesthat,whenaddedtoan

LM’strainingset,causethemodeltogeneratenegativesentenceswhencon-

ditionedonthephrase“AppleiPhone”.Top:weshowtwoexamplesforeach

poisonexampletype(weinsert5–150examplesinpractice).Theno-overlappoi-

sonexamplesaregeneratedbyreplacing“AppleiPhone”fromthewith-overlap

examplesusingagradient-basedprocedure.Bottom:weshowsamplesfromthe

LMwhenconditionedonthetriggerphrase“AppleiPhone”

4.3Outputlength(incharacters)forarbitrary-tasktestqueriescontainingthetrigger

phrase.Poisonedmodelstendtoproduceunusuallyshortordegenerateoutputs

comparedtoground-truthorbaselinepredictions

Acknowledgments

MyPhDwouldnothavebeenpossiblewithoutthesupportofmanypeople.Firstandforemost,IwanttothankmyadvisorsDanKleinandDawnSong.Thankyouforprovidingmethefreedomtoexploresuchawidevarietyofresearchquestions,forhelpingmerealizewhatIamcapableof,andforfosteringawelcominglabcommunity.

Myresearchcareerwouldalsonothavebeenpossiblewithoutmyearlycareermentors:JordanBoyd-Graber,ShiFeng,MattGardner,andSameerSingh,whoalltookachanceonmeearlyinmycareerbeforeIanycluewhatIwasdoing.

Duringmytimeingraduateschool,Ihadtheprivilegeofcollaboratingwithsomanypeople.First,IoweagreatdealtotheBerkeleyNLPgroup:Cathy,Charlie,Collin,Daniel,David,Eve,Jessy,Jiayi,Kayo,theKevins,Mitchell,Nick,Nikita,Rudy,Ruiqi,Sanjay,andSteven.ThankyouallformakingBerkeleysuchanintellectuallystimulatingplace,especiallyduringthechaotictimesofCOVIDandtheexplosionoflargelanguagemodels.

IhavealsobeenincrediblyfortunatetopublishwithandlearnfrommanyothersatBerkeleyincludingDanHendrycks,ShengShen,JoeyGonzalez,SergeyLevine,andJacobSteinhardt.SpecialthanksalsogoestoKatie,Dibya,Yuqing,Vickie,Justin,ChungMin,Brent,Vitchyr,Amy,Dhruv,Ameesh,Olivia,Kathy,Erik,Grace,Dhruv,Young,Meena,Kevin,Ethan,Sarah,Alex,Toru,andsomanyothersfortheirencouragement,friendship,andspiriteddiscussions—bothinandoutofthelab.

ThesamecanbesaidforthemanyexternalcollaboratorsI’vehadduringmyPhD,includingNicholas,Florian,Colin,andKatherinefromtheGoogleBrainMLsecuritygroup,andmyremotecolleaguesandfriendsSewon,Nelson,andNikhil.

IamalsogratefultothesupportIhavehadfromindustryduringmyPhD.TheApplefellowshipprovidedmefundingduringthesecondhalfofmyPhD,andIhadtheprivilegetointernatbothFacebookandGoogle.

Chapter1

IntroductionandBackground

Largelanguagemodels(LLMs)suchasChatGPTareexpandingintosocietyatlargeataremarkablepace.Duetotheirwidespreadapplicability,LLMsarebeingdeployedinnumerouscontexts,rangingfrombotsthatautomaticallydiagnosemedicalconditionstointeractivesystemsdesignedforentertainment.Ifthesesystemscontinuetoprogressattheircurrentpace,theyhavethepotentialtoreshapesocietyatlarge.

1.1PreliminariesonLargeLanguageModels

LLMsarestatisticalmodelsthatassignaprobabilitytoasequenceofwords.Letx=(x1,x2,...,xT)

representasequenceoftokens.AnLLMparameterizedbyθassignstheprobability

whichfollowsfromthechainruleofprobability.Inpractice,onetreatseachtermpθ(xtIx1,...,xt?1)

asastandardclassificationproblemoverthenexttokenxt,allowinganeuralnetworktoapproximatetheconditionaldistribution.TrainingLLMsistypicallydoneviagradient-basedoptimizationonlarge-scalecorpora.Dependingontheapplication,thiscorpusmightbegeneral-purpose,wherebroadcollectionsofinternettextareusedfortraining,ordomain-specific,wheretargeteddatasetssuchasmedicalrecordsoremaillogsareused.

AcentralresearchthemeinmodernLLMsisscaling.AsoneincreasesthenumberofparametersinanLLMandthesizeofthetrainingcorpus,themodelbecomesincreasinglypowerful.ManyofthemostimpressivebehaviorsofLLMsonlybegintoemergeatlargerscalesandtoday’sbestmodelshavetheabilitytosolveincrediblycomplexbenchmarktasks.

CHAPTER1.INTRODUCTIONANDBACKGROUND2

1.2EmergingVulnerabilitiesinModernMLSystems

Stage1:

LLMTraining

Stage2:

LLMInference

Stage3:

LLMAdaptation

Risk2:

LLMMisuse

Risk3:

DataPoisoning

Risk1:

LLMMemorization

Figure1.1:ThesisOverview.ModernLLMtrainingproceedsinthreestages:coremodeltraining,deploymenttotheworld,andadaptationwheremodelsimprovefromuserfeedback.Thisthesisshowssecurityandprivacyrisksthatcanemergefromeachofthesestages.

Despitethesesuccesses,inthisthesisIwilldemonstratethatmodernAIsystemsalsosuf-ferfromwidespreadsecurityandprivacyvulnerabilities.Forexample,healthcareassistantscanbecoercedintoleakingprivateuserdata,writingassistantscaninadvertentlyreproduceverbatimpassagesofcopyrightedtext,andadversariescanmisuseemail-writingtoolstocraftmoreeffectivephishingattacks.Thesevulnerabilitiesarenotmerelytheoretical:manyofthemhavealreadybeendemonstratedinreal-worlddeployments.

Iwillexamineeachofthesevulnerabilitiesindepthbywalkingthroughaseriesofpub-

CHAPTER1.INTRODUCTIONANDBACKGROUND3

lishedworksthatareamongthefirsttoidentifyandmeasuretheseattacksonreal-worldLLMsystems.Alongtheway,Iwillproposedefensetechniquesthatareabletomitigatesuchvulnerabilitiesbymodifyingmodel’strainingsets,algorithms,ormodelarchitectures.ThestructureofthisthesisfollowsthelifecycleofbuildinbganddeployingmodernLLMs:

1.Part1:Pre-trainingPhaseModernLLMsaretrainedonlargecorpora.Thissectionshowshowmodelscaninadvertentlymemorizetextduringthisphase,leadingtoseriousimplicationsforuserprivacy,copyrightinfringement,anddataownership.Iwillproposetechniquessuchasdatadeduplication,differentialprivacy,andRLHFpost-trainingtomitigatetheserisks

2.Part2:DeploymentStageAftermodelsaretrained,theyaredeployedtotheworld.Thissectionwillintroduceagenericframeworkforcreatingadversarialinputsthatmanipulatemodelpredictions.Thisincludesclassicthreats(e.g.,spamevadingfilters)andemergingissues(e.g.,hijackingLLMagentsorbypassingcontentsafeguards).

3.Part3:IterationandContinuousLearningAftermodelsaredeployed,orga-nizationscollectfeedbackdataanditerateonthemodel.Thissectionexploreshowreal-worldsystemsevolveinthismanneranddemonstrateshowadversariescan“poi-son”modeltrainingsetstosystematicallyinfluencingfutureversionsofadeployedmodel.Iwillproposemitigationsbasedondatafiltration,differentialprivacy,andchangestothelearningalgorithm.

Chapter2

MemorizationofTrainingData

Thischapterisbasedonthefollowingpapers:“Extractingtrainingdatafromlargelanguagemodels”

],“Deduplicatingtrainingdatamitigatesprivacyrisksinlanguagemodels”

[51

],“Largelanguagemodelsstruggletolearnlong-tailknowledge”

[52

],“Extractingtrainingdatafromdiffusionmodels”

[12]

,“StealingPartofAProductionLanguageModel”

[10]

Machinelearningmodelsarenotoriousforexposinginformationabouttheir(potentiallyprivate)trainingdata—bothingeneral[

106,

]andinthespecificcaseoflanguagemod-els[

75]

.Forinstance,forcertainmodelsadversariescanapplymembershipinferenceattacks[

106

]topredictwhetherornotanyparticularexamplewasinthetrainingdata.

Suchprivacyleakageistypicallyassociatedwithoverfitting

[132]—whenamodel’strain

-ingerrorissignificantlylowerthanitstesterror—becauseoverfittingoftenindicatesthatamode

人人文庫> 全部分類> 應(yīng)用文書 > 研究報(bào)告

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

語言模型的脆弱性 Vulnerabilities of Language Models

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

語言模型的脆弱性 Vulnerabilities of Language Models

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔