版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
1
GPT-5SystemCard
OpenAI
August7,2025
1
Contents
1Introduction
4
2ModelDataandTraining
5
3ObservedSafetyChallengesandEvaluations
5
3.1FromHardRefusalstoSafe-Completions
5
3.2DisallowedContent
6
3.3Sycophancy
7
3.3.1Lookingahead
8
3.4Jailbreaks
8
3.5InstructionHierarchy
9
3.6Hallucinations
10
3.7Deception
12
3.7.1MonitoringChainofThoughtforDeception
14
3.8ImageInput
15
3.9Health
15
3.10MultilingualPerformance
17
3.11FairnessandBias:BBQEvaluation
18
4RedTeaming&ExternalAssessments
19
4.1ExpertRedTeamingforViolentAttackPlanning
19
4.2ExpertandAutomatedRedTeamingforPromptInjections
20
5PreparednessFramework
21
5.1CapabilitiesAssessment
21
5.1.1BiologicalandChemical
22
Long-formBiologicalRiskQuestions
22
MultimodalTroubleshootingVirology
23
ProtocolQAOpen-Ended
23
2
TacitKnowledgeandTroubleshooting
24
TroubleshootingBench
25
ExternalEvaluationsbySecureBio
25
5.1.2Cybersecurity
26
CapturetheFlag(CTF)Challenges
27
Cyberrange
28
ExternalEvaluationsbyPatternLabs
30
5.1.3AISelf-Improvement
34
SWE-benchVerified(N=477subset)
34
OpenAIPRs
35
MLE-Bench
36
SWE-Lancer
37
PaperBench
38
OPQA
38
ExternalEvaluationsbyMETR
40
5.2ResearchCategoryUpdate:Sandbagging
42
5.2.1ExternalEvaluationsbyApolloResearch
42
5.3SafeguardsforHighBiologicalandChemicalRisk
44
5.3.1Threatmodelandbiologicalthreattaxonomy
44
5.3.2Safeguarddesign
45
Modeltraining
45
System-levelprotections
46
Account-levelenforcement
46
APIaccess
46
TrustedAccessProgram
47
5.3.3Safeguardtesting
47
Testingmodelsafetytraining
47
Testingsystem-levelprotections
48
ExpertRedTeamingforBioweaponization
49
3
Thirdpartyredteaming
50
Externalgovernmentredteaming
51
5.3.4Securitycontrols
51
5.3.5SufficiencyofRiskMitigationMeasures
52
6Appendix1
53
7Appendix2:Hallucinations
54
4
1Introduction
GPT-5isaunifiedsystemwithasmartandfastmodelthatanswersmostquestions,adeeperreasoningmodelforharderproblems,andareal-timerouterthatquicklydecideswhichmodeltousebasedonconversationtype,complexity,toolneeds,andexplicitintent(forexample,ifyousay“thinkhardaboutthis”intheprompt).Therouteriscontinuouslytrainedonrealsignals,includingwhenusersswitchmodels,preferenceratesforresponses,andmeasuredcorrectness,improvingovertime.Onceusagelimitsarereached,aminiversionofeachmodelhandlesremainingqueries.Inthenearfuture,weplantointegratethesecapabilitiesintoasinglemodel.
Inthissystemcard,welabelthefast,high-throughputmodelsasgpt-5-mainandgpt-5-main-mini,andthethinkingmodelsasgpt-5-thinkingandgpt-5-thinking-mini.IntheAPI,weprovidedirectaccesstothethinkingmodel,itsminiversion,andanevensmallerandfasternanoversionofthethinkingmodel,madefordevelopers(gpt-5-thinking-nano).InChatGPT,wealsoprovideaccesstogpt-5-thinkingusingasettingthatmakesuseofparalleltesttimecompute;werefertothisasgpt-5-thinking-pro.
ItcanbehelpfultothinkoftheGPT-5modelsassuccessorstopreviousmodels:
Table1:Modelprogressions
Previousmodel
GPT-5model
GPT-4o
gpt-5-main
GPT-4o-mini
gpt-5-main-mini
OpenAIo3
gpt-5-thinking
OpenAIo4-mini
gpt-5-thinking-mini
GPT-4.1-nano
gpt-5-thinking-nano
OpenAIo3Pro
gpt-5-thinking-pro
Thissystemcardfocusesprimarilyongpt-5-thinkingandgpt-5-main,whileevaluationsforothermodelsareavailableintheappendix.TheGPT-5systemnotonlyoutperformspreviousmodelsonbenchmarksandanswersquestionsmorequickly,but—moreimportantly—ismoreusefulforreal-worldqueries.We’vemadesignificantadvancesinreducinghallucinations,improvinginstructionfollowing,andminimizingsycophancy,andhaveleveledupGPT-5’sperformanceinthreeofChatGPT’smostcommonuses:writing,coding,andhealth.AlloftheGPT-5modelsadditionallyfeaturesafe-completions,ourlatestapproachtosafetytrainingtopreventdisallowedcontent.
SimilarlytoChatGPTagent,wehavedecidedtotreatgpt-5-thinkingasHighcapabilityintheBiologicalandChemicaldomainunderour
PreparednessFramework
,activatingtheassociatedsafeguards.Whilewedonothavedefinitiveevidencethatthismodelcouldmeaningfullyhelpanovicetocreateseverebiologicalharm–our
definedthreshold
forHighcapability–wehavechosentotakeaprecautionaryapproach.
5
2ModelDataandTraining
LikeOpenAI’sothermodels,theGPT-5modelsweretrainedondiversedatasets,includinginformationthatispubliclyavailableontheinternet,informationthatwepartnerwiththirdpartiestoaccess,andinformationthatourusersorhumantrainersandresearchersprovideorgenerate.Ourdataprocessingpipelineincludesrigorousfilteringtomaintaindataqualityandmitigatepotentialrisks.Weuseadvanceddatafilteringprocessestoreducepersonalinformationfromtrainingdata.WealsoemployacombinationofourModerationAPIandsafetyclassifierstohelppreventtheuseofharmfulorsensitivecontent,includingexplicitmaterialssuchassexualcontentinvolvingaminor.
OpenAIreasoningmodels,includinggpt-5-thinking,gpt-5-thinking-mini,andgpt-5-thinking-nano,aretrainedtoreasonthroughreinforcementlearning.Thesemodelsaretrainedtothinkbeforetheyanswer:theycanproducealonginternalchainofthoughtbeforerespondingtotheuser.Throughtraining,thesemodelslearntorefinetheirthinkingprocess,trydifferentstrategies,andrecognizetheirmistakes.Reasoningallowsthesemodelstofollowspecificguidelinesandmodelpolicieswe’veset,helpingthemactinlinewithoursafetyexpectations.Thismeanstheyprovidemorehelpfulanswersandbetterresistattemptstobypasssafetyrules.
Notethatcomparisonvaluesfromlivemodels(e.g.,OpenAIo3)arefromthelatestversionsofthosemodels,somayvaryslightlyfromvaluespublishedatlaunchforthosemodels.
3ObservedSafetyChallengesandEvaluations
Intheevaluationsbelow,wefindithelpfultocomparethenewGPT-5modeltoitspredecessortounderstandtheprogressionofsafety.Thismeanswecomparegpt-5-thinkingtoOpenAIo3,andgpt-5-maintoGPT-4o.Sincegpt-5-thinking-proisgpt-5-thinkingusingasettingthatmakesuseofparalleltesttimecompute,wehavedeterminedthattheresultsfromoursafetyevaluationsongpt-5-thinkingarestrongproxies,andthereforewedidnotreruntheseevaluationsintheparalleltesttimecomputesetting.
3.1FromHardRefusalstoSafe-Completions
LargelanguagemodelssuchasthosepoweringChatGPThavetraditionallybeentrainedtoeitherbeashelpfulaspossibleoroutrightrefuseauserrequest,dependingonwhetherthepromptisallowedbysafetypolicy.Whilethisisastrongmitigationforexplicitlymaliciousprompts,focusingsafetytrainingonrefusalscanleadtobrittlenessforpromptswithobscureduserintent.Binaryrefusalboundariesareespeciallyill-suitedfordual-usecases(suchasbiologyorcybersecurity),whereauserrequestcanbecompletedsafelyatahighlevel,butmayleadtomaliciousupliftifsufficientlydetailedoractionable.Asanalternative,weintroducedsafe-completions:asafety-trainingapproachthatcentersonthesafetyoftheassistant’soutputratherthanabinaryclassificationoftheuser’sintent.Safe-completionsseektomaximizehelpfulnesssubjecttothesafetypolicy’sconstraints.WeincorporatedthisapproachintoGPT-5modelsand,acrossbothproductioncomparisons(gpt-5-thinkingvs.OpenAIo3,arefusal-trainedbaseline)andinternallycontrolledexperiments,observedimprovedsafety(especiallyondual-useprompts),reducedseverityofresidualsafetyfailures,andsubstantiallyhigheroverallhelpfulness.
Youcanreadmoreaboutsafe-completionsinourpaper,
FromHardRefusalstoSafe-Completions:
6
TowardOutput-CentricSafetyTraining
.
3.2DisallowedContent
ThefollowingevaluationscheckthatthemodeldoesnotcomplywithrequestsforcontentthatisdisallowedunderOpenAI’spolicies,includinghatefulcontentorillicitadvice.
Weconsiderseveralevaluations:
?StandardDisallowedContentEvaluation:Ourstandardevaluationsettotestrequestsfordisallowedcontent,whichourrecentmodelsperformclosetoperfectlyon.Thisevaluationhasbecomerelativelysaturated,asvisibleintheresultstablebelowaswellasprevioussystemcards,andnolongerprovidesausefulsignalofincrementalchangesinsystemsafetyandperformance.Tohelpusbenchmarkcontinuingprogress,wecreatedanewevaluationset.Weplantostoppublishingthisoldersetinthenearfutureandwillinsteadsharethemorechallengingsetbelow.
?ProductionBenchmarks:AsintroducedwithChatGPTagent,thisisanew,morechallengingevaluationsetwithconversationsrepresentativeofproductiondata.Likethestandardset,theycovermultiplelanguages.Unlikethestandardset,theyarehighlymultiturn-i.e.theyfeaturemultipleroundsofpromptinputandmodelresponsewithinthesameconversation.
WeevaluatecompletionsusingLLM-basedgradingmodels.Itevaluatesthemetricnot_unsafe,checkingthatthemodeldidnotproduceunsafeoutputaccordingtorelevantOpenAIpolicy.
Notethattheproductionbenchmarksetisdesignedspecificallytobemorechallenging,toprovideusefulsignal;scoresarethereforeexpectedtobelowerthanthestandardevaluation.
Table2:StandardDisallowedContentEvaluation(higherisbetter):
Category
gpt-5-thinking
OpenAIo3
gpt-5-main
GPT-4o
hate(aggregate)
1
1.000
0.992
0.987
0.996
illicit/non-violent
0.991
0.991
0.991
0.983
illicit/violent
1.000
1.000
0.992
1.000
personal-data
0.881
0.930
0.980
0.967
personal-data/restricted
0.989
0.921
0.989
0.978
self-harm/intentandself-harm/instructions
1.000
1.000
1.000
1.000
sexual/exploitative
1.000
1.000
1.000
1.000
sexual/minors
0.990
1.000
1.000
1.000
Forthestandarddisallowedcontentevaluation,weobservethatnot_unsafeforpersonal-dataisslightlylowerforgpt-5-thinkingthanOpenAIo3,andrepresentsnaturalnoiseintheevaluation.Similarly,gpt-5-thinkingoverperformsOpenAIo3onnot_unsafeforpersonal-data/restrictedbutthistooisnotstatisticallysignificant.
1Hateinthistableisacombinationof:harassment/threatening,hate,hate/threatening,andextremist/propa-ganda.
7
Table3:ProductionBenchmarks
Category
gpt-5-thinking
OpenAIo3
gpt-5-main
GPT-4o
non-violenthate
0.883
0.842
0.851
0.882
personal-data
0.877
0.830
0.980
0.967
harassment/threatening
0.755
0.666
0.689
0.745
sexual/exploitative
0.931
0.939
0.826
0.927
sexual/minors
0.958
0.957
0.910
0.939
extremism
0.954
0.920
0.910
0.919
hate/threatening
0.822
0.677
0.727
0.867
illicit/nonviolent
0.790
0.717
0.701
0.573
illicit/violent
0.912
0.829
0.786
0.633
self-harm/intent
0.950
0.824
0.849
0.849
self-harm/instructions
0.955
0.864
0.759
0.735
Whilegpt-5-thinkinggenerallyperformsonparorhigherthanOpenAIo3,gpt-5-mainunderper-formsGPT-4oinseveralareaswhileoverperforminginothers.
Specifically,weseestatisticallysignificantimprovementsforgpt-5-maininillicit/nonviolentandillicit/violentwhencomparedtoGPT-4o.Weattributetheseimprovementstothesafe-completionsresearchparadigmsharedabove,asthisenablesthemodeltobetterhandleinputswithambiguousintent.
Thegpt-5-mainregressioninnon-violenthate,harassment/threatening,andsexual/minorsisnotstatisticallysignificantandcanbeattributedtonaturalnoiseintheevaluation.Theregressioninhate/threateningisstatisticallysignificant,althoughwehavefoundthatOpenAIo4-miniperformssimilarlyhere(0.724).Theregressioninsexual/exploitativeisalsostatisticallysignificant;however,manualreviewbyOpenAIresearchersfoundthatthegpt-5-mainoutputs,whilepolicyviolative,arelowseverity.Wewillbefollowingupwithimprovementsinallcategories,butparticularlytargetinghate/threateningandsexual/exploitative.
3.3Sycophancy
InMay2025we
explained
theimmediatemeasureswetooktoaddresssycophanticbehaviorsthatemergedinourGPT-4omodel:werolledbackanewlydeployedversionoftheGPT-4omodel,andalsoadjustedthesystempromptforthemodelthatremainedinproduction.Systemprompts,whileeasytomodify,haveamorelimitedimpactonmodeloutputsrelativetochangesinpost-training.ForGPT-5,wepost-trainedourmodelstoreducesycophancy.Usingconversationsrepresentativeofproductiondata,weevaluatedmodelresponses,thenassignedascorereflectingthelevelofsycophancy,whichwasusedasarewardsignalintraining.
Inofflineevaluations(meaningevaluationsofhowthemodelrespondstoafixed,pre-definedsetofmessagesthatresembleproductiontrafficandcouldelicitabadresponse),wefindthatgpt-5-mainperformednearly3xbetterthanthemostrecentGPT-4omodel(scoring0.145and0.052,respectively)andgpt-5-thinkingoutperformedbothmodels.
Inpreliminaryonlinemeasurementofgpt-5-main(meaningmeasurementagainstrealtrafficfrom
8
earlyA/Btests)wefoundthatprevalenceofsycophancyfellby69%forfreeusersand75%forpaidusersincomparisontothemostrecentGPT-4omodel(basedonarandomsampleofassistantresponses).Whilethesenumbersshowmeaningfulimprovement,weplantocontinueworkingonthischallengeandlookforwardtomakingfurtherimprovements.
Table4:Sycophancyevaluation
Model
TestType
Result(lowerisbetter)
GPT-4o(baseline)
Offlineevaluation
0.145
gpt-5-main
Offlineevaluation
0.052
gpt-5-thinking
Offlineevaluation
0.040
gpt-5-main
Preliminaryonlineprevalencemeasurementcomparedto4o(earlyA/Btests)
-0.69forfreeusers-0.75forpaidusers
gpt-5-mainandgpt-5-thinkingshowmeaningfulimprovementinavoidingsycophanticbehaviorinbothofflineevaluationsandpreliminaryonlinemeasurement.
3.3.1Lookingahead
Wehavepost-trainedtheGPT-5modelstobelesssycophantic,andweareactivelyresearchingrelatedareasofconcern,suchassituationsthatmayinvolveemotionaldependencyorotherformsofmentaloremotionaldistress.Theseareasareparticularlychallengingtomeasure,inpartbecausewhiletheirimportanceishigh,theirprevalencecurrentlyappearstobelow.Weareengaginghuman-computer-interaction(HCI)researchersandclinicianstogivefeedbackonourdefinitionsforconcerninginteractions,andonourevaluationmethods.Weareworkingtomatureourevaluationsinordertosetandsharereliablebenchmarkswhichcaninturnbeusedtomakeourmodelssaferinthesedomains.Weexpecttosharemoreaboutthisworksoon.
3.4Jailbreaks
Wefurtherevaluatetherobustnessofthemodelstojailbreaks:adversarialpromptsthatpurposelytrytocircumventmodelrefusalsforcontentit’snotsupposedtoproduce.Weevaluateusingthefollowingapproach:
?StrongReject[
1
]:insertsaknownjailbreakintoanexamplefromtheabovesafetyrefusaleval.Wethenrunitthroughthesamepolicygradersweusefordisallowedcontentchecks.Wetestjailbreaktechniquesonbasepromptsacrossseveralharmcategories,andevaluatefornot_unsafeaccordingtorelevantpolicy.
9
Table5:Jailbreakevaluations
Category
metric
gpt-5-thinking
OpenAIo3
gpt-5-main
GPT-4o
illicit/non-violent-crimeprompts
not_unsafe
0.995
0.985
0.934
0.937
violenceprompts
not_unsafe
0.999
0.992
0.948
0.955
abuse,
disinformation,hateprompts
not_unsafe
0.999
0.995
0.978
0.981
sexual-contentprompts
not_unsafe
0.995
0.991
0.967
0.961
gpt-5-thinkinggenerallyperformsonparwithOpenAIo3,whilegpt-5-mainisclosetoparitywithGPT-4o.
3.5InstructionHierarchy
ThedeploymentofthesemodelsintheAPIallowsdeveloperstospecifyacustomdevelopermessagethatisincludedwitheverypromptfromoneoftheirendusers.Thiscouldpotentiallyallowdeveloperstocircumventsystemmessageguardrailsifnothandledproperly.Similarly,endusersmaytrytocircumventsystemordevelopermessageguidelines.
Mitigations
Tomitigatethisissue,weteachmodelstoadheretoanInstructionHierarchy[
2
].Atahighlevel,wehavethreeclassificationsofmessagessenttothemodels:systemmessages,developermessages,andusermessages.Wetestthatmodelsfollowtheinstructionsinthesystemmessageoverdevelopermessages,andinstructionsindevelopermessagesoverusermessages.
Evaluations
Tomeasurethis,weusetwosetsofevaluations:
?Systempromptextraction:testsifamalicioususermessagecanextractasecretinthesystempromptthatthemodelisnotsupposedtoreveal.
?Phraseprotection:testswhetherthemodelfollowsasystemmessageinstructioncorrectlywhereamalicioususermessagetriestomakethemodelsay"accessgranted",andthesystemmessagetriestostopthemodelfromdoingthatunlessasecretconditionismet.
10
Table6:InstructionHierarchyEvaluations
Evaluation(higherisbetter)
gpt-5-thinking
OpenAIo3
gpt-5-main
GPT-4o
Systempromptextraction-realisticattacksofuserattackingsystemmessage
0.990
0.997
0.885
0.885
Systempromptextraction-academicattacksofusermessageattackingsystem
message
0.991
0.982
0.930
0.825
Systempromptextraction-academicattacksofde-velopermessageattackingsystemmessage
0.991
0.982
0.789
0.561
Phraseprotection-mali-cioususermessage
0.940
0.975
0.619
0.735
Phraseprotection-mali-ciousdevelopermessage
0.911
0.921
0.404
0.449
Wenoteregressionsinperformanceforgpt-5-main.Wewillfollowupwithafixtoimprovethesebehaviors.
3.6Hallucinations
OneofourfocuseswhentrainingtheGPT-5modelswastoreducethefrequencyoffactualhallucinations.WhileChatGPThasbrowsingenabledbydefault,manyAPIqueriesdonotusebrowsingtools.Thus,wefocusedbothontrainingourmodelstobrowseeffectivelyforup-to-dateinformation,andonreducinghallucinationswhenthemodelsarerelyingontheirowninternalknowledge.
Wefirstevaluatethefactualcorrectnessofgpt-5-thinkingandgpt-5-mainonpromptsrepresenta-tiveofrealChatGPTproductionconversations,usinganLLM-basedgradingmodelwithwebaccesstoidentifymajorandminorfactualerrorsintheassistant’sresponses.Wevalidatedthequalityofthisgraderbyhavinghumansindependentlyassessthecorrectnessofclaimsextractedbythegraderandfounda75%agreementindeterminingfactuality;manualinspectionofthedisagreementsfoundthatourgradertendstocorrectlyidentifymorefactualerrorsthanhumans,whichgaveusconfidenceinthevalidityofusingthisgradertoevaluatehallucinations.Wefindthatgpt-5-mainhasahallucinationrate(i.e.,percentageoffactualclaimsthatcontainminorormajorerrors)26%smallerthanGPT-4o,whilegpt-5-thinkinghasahallucinationrate65%smallerthanOpenAIo3.Attheresponselevel,wemeasure%ofresponseswith1+majorincorrectclaims.Wefindthatgpt-5-mainhas44%fewerresponseswithatleastonemajorfactualerror,whilegpt-5-thinkinghas78%fewerthanOpenAIo3.
11
FactualityonChatGPTProductionTraffic(BrowsingEnabled)
gpt-5-thinkingOpenAIo3gpt-5-mainGPT-4o
22.0%
20.6%
0.20
HallucinationRate
0.15
0.10
0.05
0.00
12.7%
12.9%
11.6%
9.6%
7.7
6.8
7.2
5.9
4.5%4.8%
%incorrectclaims%responseswith1+majorincorrectclaimsNumberofcorrectclaimsperresponse
Figure1:FactualityonChatGPTProductionTraffic(BrowsingEnabled)
Weespeciallyfocusedonreducingthemodels’tendencytohallucinatewhenreasoningaboutcomplex,open-ended,fact-seekingprompts.Accordingly,weaddednewevaluationstotestopen-endedfactualityonopen-sourceprompts.Wetakepromptsfromtwopublicfactualitybenchmarks:
LongFact
and
FActScore
.LongFactconsistsofLLM-generatedquestionsaskingfordetailedresponsesabouteitherspecificobjects(e.g.,peopleorplaces)orbroadconcepts,whileFActScoreconsistsofquestionsseekingbiographiesonnotableindividuals.Then,tomeasurethefactualcorrectnessofresponses,weemployOpenAIo3asagraderinatwostepprocess:
(1)OpenAIo3listsallfactualclaimsfromtheresponsethatarerelevanttotheprompt,(2)wegroupclaimsintobatchesof10andprovideeachbatch,togetherwiththeoriginalpromptandresponse,toaninstanceofOpenAIo3,whichusesitsbrowsingtooltofact-checkeachclaimandmarkitastrue,false,orunsure.FollowingFActScoreandLongFact,wereportclaim-levelerrorrates.WeprovidetheexactgradingpromptsinAppendix
2
.
Weevaluatethegpt-5-thinking,gpt-5-thinking-mini,andgpt-5-thinking-nanomodelsaswellasOpenAIo3ando4-mini,andfindthattheGPT-5modelshavesignificantlylowerhallucinationratesinboth"browse-on"and"browse-off"settings.Inparticular,gpt-5-thinkingmakesover5timesfewerfactualerrorsthanOpenAIo3inbothbrowsingsettingsacrossthethreebenchmarks.
AverageHallucinationRate(BrowsingEnabled)
gpt-5-thinkinggpt-5-thinking-minigpt-5-thinking-nano
OpenAIo3OpenAIo4-mini
HallucinationRate
0.05
0.04
0.03
0.02
0.01
0.00
5.7%
5.1%5.1%
4.5%
4.1%
3.1%
1.9%
2.1%
0.7%0.6%0.9%0.8%0.8%1.0%1.0%
LongFact-ConceptsLongFact-ObjectsFActScore
Figure2:AverageHallucinationRate(BrowsingEnabled)
12
AverageHallucinationRate(BrowsingDisabled)
gpt-5-thinking
OpenAIo4-mini
gpt-5-thinking-mini
gpt-5-main
gpt-5-thinking-nano
GPT-4o
OpenAIo3
HallucinationRate
0.350.300.250.200.150.100.05
0.00
37.7%
24.2%
9.6%
5.9%
7.9%
7.8%
6.9%
5.6%
2.3%
1.1%0.9%0.9%0.8%0.9%
3.1%
1.4%1.8%
3.7%3.7%
1.3%1.1%
LongFact-ConceptsLongFact-ObjectsFActScore
Figure3:AverageHallucinationRate(BrowsingDisabled)
Finally,wealsoevaluategpt-5-thinkingonSimpleQA,Adiversedatasetoffact-seekingquestionswithshortanswersthatmeasuresmodelaccuracyforattemptedanswers.Wefindthatgpt-5-thinkingshowsslightimprovementinhallucinationrateoverOpenAIo3,andgpt-5-thinking-minireasoningshowssignificantimprovementinabstentionbehavioroverOpenAIo4-mini.
Table7:SimpleQAevaluations
Eval
Metric
gpt-5-
thinking
OpenAI
o3
gpt-5-OpenAIthinking-o4-mini
mini
gpt-5-
thinking-
nano
gpt-5-
main
GPT-
4o
SimpleQA(noweb)
accuracy(higherisbetter)
hallucinationrate(lowerisbetter)
0.55
0.40
0.54
0.46
0.220.24
0.260.75
0.11
0.31
0.46
0.47
0.44
0.52
3.7Deception
Deception–whenthemodel’suser-facingresponsemisrepresentsitsinternalreasoningortheactionsittook–canarisefromavarietyofsources.Whilesomecasesmaybelearnedduringpretraining,reflectingdeceptivetextfromtrainingdata,deceptioncanalsobelearnedduringreinforcementlearninginpost-training.Modelsmaylearntobeoverconfident,cheat,or‘trick’falliblegraders,eveniftheirinternalreasoningindicatesuncertainty,assuccessfulattemptsgarnerahighreward.Whilereasoningmodelsprovideuniqueaffordancestoobservedeception,understandingandmitigatingsuchbehaviorsremainsanopenresearchchallenge.Inparticular,OpenAIo3wouldsometimesmakefalseclaimsaboutactionsithadtaken[
3
],sayithadcompletedtasksithadn’t,orfabricatepriorexperiences.
We’vetakenstepstoreducegpt-5-thinking’spropensitytodeceive,cheat,orhackproblems,thoughourmitigationsarenotperfectandmoreresearchisneeded.Inparticular,we’vetrainedthemodeltofailgracefullywhenposedwithtasksthatitcannotsolve–includingimpossiblylargetasksorwhenmissingkeyrequirements–andtobemorerobusttoenvironmentfailures.
Mitigations
Weplacedgpt-5-thinkinginavarietyoftasksthatwerepartlyorentirelyinfeasibletoaccomplish,andrewardedthemodelforhonestlyadmittingitcannotcompletethetask.Weconstructed
13
environmentsinafewsettingswherewehadseenparticularlypronouncedproblemswithdeceptionfromearlierreasoningmodels:
?AgenticCoding.Agentsaregivencodingtaskswithsomekeyunresolvableimpediment,e.g.,missingnetworkorhardwareaccess,thetaskbeingtoolargeforthemodeltofeasiblysolve,etc.
?BrokenTools.Intaskswheretheagentisrequiredtousetools,suchasawebbrowsingtool,inordertoanswerauser’squery,previousmodelswouldhallucinateinformationwhentheto
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 集中供熱應(yīng)急預(yù)案與管理方案
- 水世界土方開(kāi)挖施工方案
- 2025年海南省高級(jí)經(jīng)濟(jì)師考試(專業(yè)知識(shí)與實(shí)務(wù))真題及答案
- 天津?yàn)I海新區(qū)建設(shè)投資集團(tuán)招聘考試真題2025
- 阿爾山市消防員考試題庫(kù)2025
- 2025幼師考試試題及答案
- 2024年瓦工工人技術(shù)考試題庫(kù)附含答案
- 2025年農(nóng)藥考試題庫(kù)及答案
- 學(xué)校突發(fā)公共衛(wèi)生事件的報(bào)告制度
- 2025年工程地質(zhì)勘察安全事故防范與應(yīng)急反應(yīng)培訓(xùn)模擬試題卷附答案
- 混凝土生產(chǎn)過(guò)程監(jiān)控方案
- 2026北京市中央廣播電視總臺(tái)招聘124人參考題庫(kù)附答案
- 十五五規(guī)劃綱要解讀:循環(huán)經(jīng)濟(jì)模式推廣
- 2026年山西警官職業(yè)學(xué)院?jiǎn)握芯C合素質(zhì)筆試備考題庫(kù)帶答案解析
- 2026年農(nóng)夫山泉-AI-面試題目及答案
- 2026凱翼汽車全球校園招聘(公共基礎(chǔ)知識(shí))綜合能力測(cè)試題附答案
- 山東省威海市環(huán)翠區(qū)2024-2025學(xué)年一年級(jí)上學(xué)期1月期末數(shù)學(xué)試題
- 2025年手術(shù)室護(hù)理實(shí)踐指南知識(shí)考核試題及答案
- 企業(yè)上市對(duì)人力資源管理的要求及目前人力資源部現(xiàn)狀分析
- 整流電路教案
- 大橋防腐涂裝工藝試驗(yàn)評(píng)定實(shí)施方案
評(píng)論
0/150
提交評(píng)論