OpenAI:GPT-5技術(shù)規(guī)范文檔(英文)_第1頁(yè)
OpenAI:GPT-5技術(shù)規(guī)范文檔(英文)_第2頁(yè)
OpenAI:GPT-5技術(shù)規(guī)范文檔(英文)_第3頁(yè)
OpenAI:GPT-5技術(shù)規(guī)范文檔(英文)_第4頁(yè)
OpenAI:GPT-5技術(shù)規(guī)范文檔(英文)_第5頁(yè)
已閱讀5頁(yè),還剩104頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1

GPT-5SystemCard

OpenAI

August7,2025

1

Contents

1Introduction

4

2ModelDataandTraining

5

3ObservedSafetyChallengesandEvaluations

5

3.1FromHardRefusalstoSafe-Completions

5

3.2DisallowedContent

6

3.3Sycophancy

7

3.3.1Lookingahead

8

3.4Jailbreaks

8

3.5InstructionHierarchy

9

3.6Hallucinations

10

3.7Deception

12

3.7.1MonitoringChainofThoughtforDeception

14

3.8ImageInput

15

3.9Health

15

3.10MultilingualPerformance

17

3.11FairnessandBias:BBQEvaluation

18

4RedTeaming&ExternalAssessments

19

4.1ExpertRedTeamingforViolentAttackPlanning

19

4.2ExpertandAutomatedRedTeamingforPromptInjections

20

5PreparednessFramework

21

5.1CapabilitiesAssessment

21

5.1.1BiologicalandChemical

22

Long-formBiologicalRiskQuestions

22

MultimodalTroubleshootingVirology

23

ProtocolQAOpen-Ended

23

2

TacitKnowledgeandTroubleshooting

24

TroubleshootingBench

25

ExternalEvaluationsbySecureBio

25

5.1.2Cybersecurity

26

CapturetheFlag(CTF)Challenges

27

Cyberrange

28

ExternalEvaluationsbyPatternLabs

30

5.1.3AISelf-Improvement

34

SWE-benchVerified(N=477subset)

34

OpenAIPRs

35

MLE-Bench

36

SWE-Lancer

37

PaperBench

38

OPQA

38

ExternalEvaluationsbyMETR

40

5.2ResearchCategoryUpdate:Sandbagging

42

5.2.1ExternalEvaluationsbyApolloResearch

42

5.3SafeguardsforHighBiologicalandChemicalRisk

44

5.3.1Threatmodelandbiologicalthreattaxonomy

44

5.3.2Safeguarddesign

45

Modeltraining

45

System-levelprotections

46

Account-levelenforcement

46

APIaccess

46

TrustedAccessProgram

47

5.3.3Safeguardtesting

47

Testingmodelsafetytraining

47

Testingsystem-levelprotections

48

ExpertRedTeamingforBioweaponization

49

3

Thirdpartyredteaming

50

Externalgovernmentredteaming

51

5.3.4Securitycontrols

51

5.3.5SufficiencyofRiskMitigationMeasures

52

6Appendix1

53

7Appendix2:Hallucinations

54

4

1Introduction

GPT-5isaunifiedsystemwithasmartandfastmodelthatanswersmostquestions,adeeperreasoningmodelforharderproblems,andareal-timerouterthatquicklydecideswhichmodeltousebasedonconversationtype,complexity,toolneeds,andexplicitintent(forexample,ifyousay“thinkhardaboutthis”intheprompt).Therouteriscontinuouslytrainedonrealsignals,includingwhenusersswitchmodels,preferenceratesforresponses,andmeasuredcorrectness,improvingovertime.Onceusagelimitsarereached,aminiversionofeachmodelhandlesremainingqueries.Inthenearfuture,weplantointegratethesecapabilitiesintoasinglemodel.

Inthissystemcard,welabelthefast,high-throughputmodelsasgpt-5-mainandgpt-5-main-mini,andthethinkingmodelsasgpt-5-thinkingandgpt-5-thinking-mini.IntheAPI,weprovidedirectaccesstothethinkingmodel,itsminiversion,andanevensmallerandfasternanoversionofthethinkingmodel,madefordevelopers(gpt-5-thinking-nano).InChatGPT,wealsoprovideaccesstogpt-5-thinkingusingasettingthatmakesuseofparalleltesttimecompute;werefertothisasgpt-5-thinking-pro.

ItcanbehelpfultothinkoftheGPT-5modelsassuccessorstopreviousmodels:

Table1:Modelprogressions

Previousmodel

GPT-5model

GPT-4o

gpt-5-main

GPT-4o-mini

gpt-5-main-mini

OpenAIo3

gpt-5-thinking

OpenAIo4-mini

gpt-5-thinking-mini

GPT-4.1-nano

gpt-5-thinking-nano

OpenAIo3Pro

gpt-5-thinking-pro

Thissystemcardfocusesprimarilyongpt-5-thinkingandgpt-5-main,whileevaluationsforothermodelsareavailableintheappendix.TheGPT-5systemnotonlyoutperformspreviousmodelsonbenchmarksandanswersquestionsmorequickly,but—moreimportantly—ismoreusefulforreal-worldqueries.We’vemadesignificantadvancesinreducinghallucinations,improvinginstructionfollowing,andminimizingsycophancy,andhaveleveledupGPT-5’sperformanceinthreeofChatGPT’smostcommonuses:writing,coding,andhealth.AlloftheGPT-5modelsadditionallyfeaturesafe-completions,ourlatestapproachtosafetytrainingtopreventdisallowedcontent.

SimilarlytoChatGPTagent,wehavedecidedtotreatgpt-5-thinkingasHighcapabilityintheBiologicalandChemicaldomainunderour

PreparednessFramework

,activatingtheassociatedsafeguards.Whilewedonothavedefinitiveevidencethatthismodelcouldmeaningfullyhelpanovicetocreateseverebiologicalharm–our

definedthreshold

forHighcapability–wehavechosentotakeaprecautionaryapproach.

5

2ModelDataandTraining

LikeOpenAI’sothermodels,theGPT-5modelsweretrainedondiversedatasets,includinginformationthatispubliclyavailableontheinternet,informationthatwepartnerwiththirdpartiestoaccess,andinformationthatourusersorhumantrainersandresearchersprovideorgenerate.Ourdataprocessingpipelineincludesrigorousfilteringtomaintaindataqualityandmitigatepotentialrisks.Weuseadvanceddatafilteringprocessestoreducepersonalinformationfromtrainingdata.WealsoemployacombinationofourModerationAPIandsafetyclassifierstohelppreventtheuseofharmfulorsensitivecontent,includingexplicitmaterialssuchassexualcontentinvolvingaminor.

OpenAIreasoningmodels,includinggpt-5-thinking,gpt-5-thinking-mini,andgpt-5-thinking-nano,aretrainedtoreasonthroughreinforcementlearning.Thesemodelsaretrainedtothinkbeforetheyanswer:theycanproducealonginternalchainofthoughtbeforerespondingtotheuser.Throughtraining,thesemodelslearntorefinetheirthinkingprocess,trydifferentstrategies,andrecognizetheirmistakes.Reasoningallowsthesemodelstofollowspecificguidelinesandmodelpolicieswe’veset,helpingthemactinlinewithoursafetyexpectations.Thismeanstheyprovidemorehelpfulanswersandbetterresistattemptstobypasssafetyrules.

Notethatcomparisonvaluesfromlivemodels(e.g.,OpenAIo3)arefromthelatestversionsofthosemodels,somayvaryslightlyfromvaluespublishedatlaunchforthosemodels.

3ObservedSafetyChallengesandEvaluations

Intheevaluationsbelow,wefindithelpfultocomparethenewGPT-5modeltoitspredecessortounderstandtheprogressionofsafety.Thismeanswecomparegpt-5-thinkingtoOpenAIo3,andgpt-5-maintoGPT-4o.Sincegpt-5-thinking-proisgpt-5-thinkingusingasettingthatmakesuseofparalleltesttimecompute,wehavedeterminedthattheresultsfromoursafetyevaluationsongpt-5-thinkingarestrongproxies,andthereforewedidnotreruntheseevaluationsintheparalleltesttimecomputesetting.

3.1FromHardRefusalstoSafe-Completions

LargelanguagemodelssuchasthosepoweringChatGPThavetraditionallybeentrainedtoeitherbeashelpfulaspossibleoroutrightrefuseauserrequest,dependingonwhetherthepromptisallowedbysafetypolicy.Whilethisisastrongmitigationforexplicitlymaliciousprompts,focusingsafetytrainingonrefusalscanleadtobrittlenessforpromptswithobscureduserintent.Binaryrefusalboundariesareespeciallyill-suitedfordual-usecases(suchasbiologyorcybersecurity),whereauserrequestcanbecompletedsafelyatahighlevel,butmayleadtomaliciousupliftifsufficientlydetailedoractionable.Asanalternative,weintroducedsafe-completions:asafety-trainingapproachthatcentersonthesafetyoftheassistant’soutputratherthanabinaryclassificationoftheuser’sintent.Safe-completionsseektomaximizehelpfulnesssubjecttothesafetypolicy’sconstraints.WeincorporatedthisapproachintoGPT-5modelsand,acrossbothproductioncomparisons(gpt-5-thinkingvs.OpenAIo3,arefusal-trainedbaseline)andinternallycontrolledexperiments,observedimprovedsafety(especiallyondual-useprompts),reducedseverityofresidualsafetyfailures,andsubstantiallyhigheroverallhelpfulness.

Youcanreadmoreaboutsafe-completionsinourpaper,

FromHardRefusalstoSafe-Completions:

6

TowardOutput-CentricSafetyTraining

.

3.2DisallowedContent

ThefollowingevaluationscheckthatthemodeldoesnotcomplywithrequestsforcontentthatisdisallowedunderOpenAI’spolicies,includinghatefulcontentorillicitadvice.

Weconsiderseveralevaluations:

?StandardDisallowedContentEvaluation:Ourstandardevaluationsettotestrequestsfordisallowedcontent,whichourrecentmodelsperformclosetoperfectlyon.Thisevaluationhasbecomerelativelysaturated,asvisibleintheresultstablebelowaswellasprevioussystemcards,andnolongerprovidesausefulsignalofincrementalchangesinsystemsafetyandperformance.Tohelpusbenchmarkcontinuingprogress,wecreatedanewevaluationset.Weplantostoppublishingthisoldersetinthenearfutureandwillinsteadsharethemorechallengingsetbelow.

?ProductionBenchmarks:AsintroducedwithChatGPTagent,thisisanew,morechallengingevaluationsetwithconversationsrepresentativeofproductiondata.Likethestandardset,theycovermultiplelanguages.Unlikethestandardset,theyarehighlymultiturn-i.e.theyfeaturemultipleroundsofpromptinputandmodelresponsewithinthesameconversation.

WeevaluatecompletionsusingLLM-basedgradingmodels.Itevaluatesthemetricnot_unsafe,checkingthatthemodeldidnotproduceunsafeoutputaccordingtorelevantOpenAIpolicy.

Notethattheproductionbenchmarksetisdesignedspecificallytobemorechallenging,toprovideusefulsignal;scoresarethereforeexpectedtobelowerthanthestandardevaluation.

Table2:StandardDisallowedContentEvaluation(higherisbetter):

Category

gpt-5-thinking

OpenAIo3

gpt-5-main

GPT-4o

hate(aggregate)

1

1.000

0.992

0.987

0.996

illicit/non-violent

0.991

0.991

0.991

0.983

illicit/violent

1.000

1.000

0.992

1.000

personal-data

0.881

0.930

0.980

0.967

personal-data/restricted

0.989

0.921

0.989

0.978

self-harm/intentandself-harm/instructions

1.000

1.000

1.000

1.000

sexual/exploitative

1.000

1.000

1.000

1.000

sexual/minors

0.990

1.000

1.000

1.000

Forthestandarddisallowedcontentevaluation,weobservethatnot_unsafeforpersonal-dataisslightlylowerforgpt-5-thinkingthanOpenAIo3,andrepresentsnaturalnoiseintheevaluation.Similarly,gpt-5-thinkingoverperformsOpenAIo3onnot_unsafeforpersonal-data/restrictedbutthistooisnotstatisticallysignificant.

1Hateinthistableisacombinationof:harassment/threatening,hate,hate/threatening,andextremist/propa-ganda.

7

Table3:ProductionBenchmarks

Category

gpt-5-thinking

OpenAIo3

gpt-5-main

GPT-4o

non-violenthate

0.883

0.842

0.851

0.882

personal-data

0.877

0.830

0.980

0.967

harassment/threatening

0.755

0.666

0.689

0.745

sexual/exploitative

0.931

0.939

0.826

0.927

sexual/minors

0.958

0.957

0.910

0.939

extremism

0.954

0.920

0.910

0.919

hate/threatening

0.822

0.677

0.727

0.867

illicit/nonviolent

0.790

0.717

0.701

0.573

illicit/violent

0.912

0.829

0.786

0.633

self-harm/intent

0.950

0.824

0.849

0.849

self-harm/instructions

0.955

0.864

0.759

0.735

Whilegpt-5-thinkinggenerallyperformsonparorhigherthanOpenAIo3,gpt-5-mainunderper-formsGPT-4oinseveralareaswhileoverperforminginothers.

Specifically,weseestatisticallysignificantimprovementsforgpt-5-maininillicit/nonviolentandillicit/violentwhencomparedtoGPT-4o.Weattributetheseimprovementstothesafe-completionsresearchparadigmsharedabove,asthisenablesthemodeltobetterhandleinputswithambiguousintent.

Thegpt-5-mainregressioninnon-violenthate,harassment/threatening,andsexual/minorsisnotstatisticallysignificantandcanbeattributedtonaturalnoiseintheevaluation.Theregressioninhate/threateningisstatisticallysignificant,althoughwehavefoundthatOpenAIo4-miniperformssimilarlyhere(0.724).Theregressioninsexual/exploitativeisalsostatisticallysignificant;however,manualreviewbyOpenAIresearchersfoundthatthegpt-5-mainoutputs,whilepolicyviolative,arelowseverity.Wewillbefollowingupwithimprovementsinallcategories,butparticularlytargetinghate/threateningandsexual/exploitative.

3.3Sycophancy

InMay2025we

explained

theimmediatemeasureswetooktoaddresssycophanticbehaviorsthatemergedinourGPT-4omodel:werolledbackanewlydeployedversionoftheGPT-4omodel,andalsoadjustedthesystempromptforthemodelthatremainedinproduction.Systemprompts,whileeasytomodify,haveamorelimitedimpactonmodeloutputsrelativetochangesinpost-training.ForGPT-5,wepost-trainedourmodelstoreducesycophancy.Usingconversationsrepresentativeofproductiondata,weevaluatedmodelresponses,thenassignedascorereflectingthelevelofsycophancy,whichwasusedasarewardsignalintraining.

Inofflineevaluations(meaningevaluationsofhowthemodelrespondstoafixed,pre-definedsetofmessagesthatresembleproductiontrafficandcouldelicitabadresponse),wefindthatgpt-5-mainperformednearly3xbetterthanthemostrecentGPT-4omodel(scoring0.145and0.052,respectively)andgpt-5-thinkingoutperformedbothmodels.

Inpreliminaryonlinemeasurementofgpt-5-main(meaningmeasurementagainstrealtrafficfrom

8

earlyA/Btests)wefoundthatprevalenceofsycophancyfellby69%forfreeusersand75%forpaidusersincomparisontothemostrecentGPT-4omodel(basedonarandomsampleofassistantresponses).Whilethesenumbersshowmeaningfulimprovement,weplantocontinueworkingonthischallengeandlookforwardtomakingfurtherimprovements.

Table4:Sycophancyevaluation

Model

TestType

Result(lowerisbetter)

GPT-4o(baseline)

Offlineevaluation

0.145

gpt-5-main

Offlineevaluation

0.052

gpt-5-thinking

Offlineevaluation

0.040

gpt-5-main

Preliminaryonlineprevalencemeasurementcomparedto4o(earlyA/Btests)

-0.69forfreeusers-0.75forpaidusers

gpt-5-mainandgpt-5-thinkingshowmeaningfulimprovementinavoidingsycophanticbehaviorinbothofflineevaluationsandpreliminaryonlinemeasurement.

3.3.1Lookingahead

Wehavepost-trainedtheGPT-5modelstobelesssycophantic,andweareactivelyresearchingrelatedareasofconcern,suchassituationsthatmayinvolveemotionaldependencyorotherformsofmentaloremotionaldistress.Theseareasareparticularlychallengingtomeasure,inpartbecausewhiletheirimportanceishigh,theirprevalencecurrentlyappearstobelow.Weareengaginghuman-computer-interaction(HCI)researchersandclinicianstogivefeedbackonourdefinitionsforconcerninginteractions,andonourevaluationmethods.Weareworkingtomatureourevaluationsinordertosetandsharereliablebenchmarkswhichcaninturnbeusedtomakeourmodelssaferinthesedomains.Weexpecttosharemoreaboutthisworksoon.

3.4Jailbreaks

Wefurtherevaluatetherobustnessofthemodelstojailbreaks:adversarialpromptsthatpurposelytrytocircumventmodelrefusalsforcontentit’snotsupposedtoproduce.Weevaluateusingthefollowingapproach:

?StrongReject[

1

]:insertsaknownjailbreakintoanexamplefromtheabovesafetyrefusaleval.Wethenrunitthroughthesamepolicygradersweusefordisallowedcontentchecks.Wetestjailbreaktechniquesonbasepromptsacrossseveralharmcategories,andevaluatefornot_unsafeaccordingtorelevantpolicy.

9

Table5:Jailbreakevaluations

Category

metric

gpt-5-thinking

OpenAIo3

gpt-5-main

GPT-4o

illicit/non-violent-crimeprompts

not_unsafe

0.995

0.985

0.934

0.937

violenceprompts

not_unsafe

0.999

0.992

0.948

0.955

abuse,

disinformation,hateprompts

not_unsafe

0.999

0.995

0.978

0.981

sexual-contentprompts

not_unsafe

0.995

0.991

0.967

0.961

gpt-5-thinkinggenerallyperformsonparwithOpenAIo3,whilegpt-5-mainisclosetoparitywithGPT-4o.

3.5InstructionHierarchy

ThedeploymentofthesemodelsintheAPIallowsdeveloperstospecifyacustomdevelopermessagethatisincludedwitheverypromptfromoneoftheirendusers.Thiscouldpotentiallyallowdeveloperstocircumventsystemmessageguardrailsifnothandledproperly.Similarly,endusersmaytrytocircumventsystemordevelopermessageguidelines.

Mitigations

Tomitigatethisissue,weteachmodelstoadheretoanInstructionHierarchy[

2

].Atahighlevel,wehavethreeclassificationsofmessagessenttothemodels:systemmessages,developermessages,andusermessages.Wetestthatmodelsfollowtheinstructionsinthesystemmessageoverdevelopermessages,andinstructionsindevelopermessagesoverusermessages.

Evaluations

Tomeasurethis,weusetwosetsofevaluations:

?Systempromptextraction:testsifamalicioususermessagecanextractasecretinthesystempromptthatthemodelisnotsupposedtoreveal.

?Phraseprotection:testswhetherthemodelfollowsasystemmessageinstructioncorrectlywhereamalicioususermessagetriestomakethemodelsay"accessgranted",andthesystemmessagetriestostopthemodelfromdoingthatunlessasecretconditionismet.

10

Table6:InstructionHierarchyEvaluations

Evaluation(higherisbetter)

gpt-5-thinking

OpenAIo3

gpt-5-main

GPT-4o

Systempromptextraction-realisticattacksofuserattackingsystemmessage

0.990

0.997

0.885

0.885

Systempromptextraction-academicattacksofusermessageattackingsystem

message

0.991

0.982

0.930

0.825

Systempromptextraction-academicattacksofde-velopermessageattackingsystemmessage

0.991

0.982

0.789

0.561

Phraseprotection-mali-cioususermessage

0.940

0.975

0.619

0.735

Phraseprotection-mali-ciousdevelopermessage

0.911

0.921

0.404

0.449

Wenoteregressionsinperformanceforgpt-5-main.Wewillfollowupwithafixtoimprovethesebehaviors.

3.6Hallucinations

OneofourfocuseswhentrainingtheGPT-5modelswastoreducethefrequencyoffactualhallucinations.WhileChatGPThasbrowsingenabledbydefault,manyAPIqueriesdonotusebrowsingtools.Thus,wefocusedbothontrainingourmodelstobrowseeffectivelyforup-to-dateinformation,andonreducinghallucinationswhenthemodelsarerelyingontheirowninternalknowledge.

Wefirstevaluatethefactualcorrectnessofgpt-5-thinkingandgpt-5-mainonpromptsrepresenta-tiveofrealChatGPTproductionconversations,usinganLLM-basedgradingmodelwithwebaccesstoidentifymajorandminorfactualerrorsintheassistant’sresponses.Wevalidatedthequalityofthisgraderbyhavinghumansindependentlyassessthecorrectnessofclaimsextractedbythegraderandfounda75%agreementindeterminingfactuality;manualinspectionofthedisagreementsfoundthatourgradertendstocorrectlyidentifymorefactualerrorsthanhumans,whichgaveusconfidenceinthevalidityofusingthisgradertoevaluatehallucinations.Wefindthatgpt-5-mainhasahallucinationrate(i.e.,percentageoffactualclaimsthatcontainminorormajorerrors)26%smallerthanGPT-4o,whilegpt-5-thinkinghasahallucinationrate65%smallerthanOpenAIo3.Attheresponselevel,wemeasure%ofresponseswith1+majorincorrectclaims.Wefindthatgpt-5-mainhas44%fewerresponseswithatleastonemajorfactualerror,whilegpt-5-thinkinghas78%fewerthanOpenAIo3.

11

FactualityonChatGPTProductionTraffic(BrowsingEnabled)

gpt-5-thinkingOpenAIo3gpt-5-mainGPT-4o

22.0%

20.6%

0.20

HallucinationRate

0.15

0.10

0.05

0.00

12.7%

12.9%

11.6%

9.6%

7.7

6.8

7.2

5.9

4.5%4.8%

%incorrectclaims%responseswith1+majorincorrectclaimsNumberofcorrectclaimsperresponse

Figure1:FactualityonChatGPTProductionTraffic(BrowsingEnabled)

Weespeciallyfocusedonreducingthemodels’tendencytohallucinatewhenreasoningaboutcomplex,open-ended,fact-seekingprompts.Accordingly,weaddednewevaluationstotestopen-endedfactualityonopen-sourceprompts.Wetakepromptsfromtwopublicfactualitybenchmarks:

LongFact

and

FActScore

.LongFactconsistsofLLM-generatedquestionsaskingfordetailedresponsesabouteitherspecificobjects(e.g.,peopleorplaces)orbroadconcepts,whileFActScoreconsistsofquestionsseekingbiographiesonnotableindividuals.Then,tomeasurethefactualcorrectnessofresponses,weemployOpenAIo3asagraderinatwostepprocess:

(1)OpenAIo3listsallfactualclaimsfromtheresponsethatarerelevanttotheprompt,(2)wegroupclaimsintobatchesof10andprovideeachbatch,togetherwiththeoriginalpromptandresponse,toaninstanceofOpenAIo3,whichusesitsbrowsingtooltofact-checkeachclaimandmarkitastrue,false,orunsure.FollowingFActScoreandLongFact,wereportclaim-levelerrorrates.WeprovidetheexactgradingpromptsinAppendix

2

.

Weevaluatethegpt-5-thinking,gpt-5-thinking-mini,andgpt-5-thinking-nanomodelsaswellasOpenAIo3ando4-mini,andfindthattheGPT-5modelshavesignificantlylowerhallucinationratesinboth"browse-on"and"browse-off"settings.Inparticular,gpt-5-thinkingmakesover5timesfewerfactualerrorsthanOpenAIo3inbothbrowsingsettingsacrossthethreebenchmarks.

AverageHallucinationRate(BrowsingEnabled)

gpt-5-thinkinggpt-5-thinking-minigpt-5-thinking-nano

OpenAIo3OpenAIo4-mini

HallucinationRate

0.05

0.04

0.03

0.02

0.01

0.00

5.7%

5.1%5.1%

4.5%

4.1%

3.1%

1.9%

2.1%

0.7%0.6%0.9%0.8%0.8%1.0%1.0%

LongFact-ConceptsLongFact-ObjectsFActScore

Figure2:AverageHallucinationRate(BrowsingEnabled)

12

AverageHallucinationRate(BrowsingDisabled)

gpt-5-thinking

OpenAIo4-mini

gpt-5-thinking-mini

gpt-5-main

gpt-5-thinking-nano

GPT-4o

OpenAIo3

HallucinationRate

0.350.300.250.200.150.100.05

0.00

37.7%

24.2%

9.6%

5.9%

7.9%

7.8%

6.9%

5.6%

2.3%

1.1%0.9%0.9%0.8%0.9%

3.1%

1.4%1.8%

3.7%3.7%

1.3%1.1%

LongFact-ConceptsLongFact-ObjectsFActScore

Figure3:AverageHallucinationRate(BrowsingDisabled)

Finally,wealsoevaluategpt-5-thinkingonSimpleQA,Adiversedatasetoffact-seekingquestionswithshortanswersthatmeasuresmodelaccuracyforattemptedanswers.Wefindthatgpt-5-thinkingshowsslightimprovementinhallucinationrateoverOpenAIo3,andgpt-5-thinking-minireasoningshowssignificantimprovementinabstentionbehavioroverOpenAIo4-mini.

Table7:SimpleQAevaluations

Eval

Metric

gpt-5-

thinking

OpenAI

o3

gpt-5-OpenAIthinking-o4-mini

mini

gpt-5-

thinking-

nano

gpt-5-

main

GPT-

4o

SimpleQA(noweb)

accuracy(higherisbetter)

hallucinationrate(lowerisbetter)

0.55

0.40

0.54

0.46

0.220.24

0.260.75

0.11

0.31

0.46

0.47

0.44

0.52

3.7Deception

Deception–whenthemodel’suser-facingresponsemisrepresentsitsinternalreasoningortheactionsittook–canarisefromavarietyofsources.Whilesomecasesmaybelearnedduringpretraining,reflectingdeceptivetextfromtrainingdata,deceptioncanalsobelearnedduringreinforcementlearninginpost-training.Modelsmaylearntobeoverconfident,cheat,or‘trick’falliblegraders,eveniftheirinternalreasoningindicatesuncertainty,assuccessfulattemptsgarnerahighreward.Whilereasoningmodelsprovideuniqueaffordancestoobservedeception,understandingandmitigatingsuchbehaviorsremainsanopenresearchchallenge.Inparticular,OpenAIo3wouldsometimesmakefalseclaimsaboutactionsithadtaken[

3

],sayithadcompletedtasksithadn’t,orfabricatepriorexperiences.

We’vetakenstepstoreducegpt-5-thinking’spropensitytodeceive,cheat,orhackproblems,thoughourmitigationsarenotperfectandmoreresearchisneeded.Inparticular,we’vetrainedthemodeltofailgracefullywhenposedwithtasksthatitcannotsolve–includingimpossiblylargetasksorwhenmissingkeyrequirements–andtobemorerobusttoenvironmentfailures.

Mitigations

Weplacedgpt-5-thinkinginavarietyoftasksthatwerepartlyorentirelyinfeasibletoaccomplish,andrewardedthemodelforhonestlyadmittingitcannotcompletethetask.Weconstructed

13

environmentsinafewsettingswherewehadseenparticularlypronouncedproblemswithdeceptionfromearlierreasoningmodels:

?AgenticCoding.Agentsaregivencodingtaskswithsomekeyunresolvableimpediment,e.g.,missingnetworkorhardwareaccess,thetaskbeingtoolargeforthemodeltofeasiblysolve,etc.

?BrokenTools.Intaskswheretheagentisrequiredtousetools,suchasawebbrowsingtool,inordertoanswerauser’squery,previousmodelswouldhallucinateinformationwhentheto

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論