版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)
文檔簡介
BenchmarkingLLMsonAdvancedMathematicalReasoning
JonathanYueDanielKlein
ElectricalEngineeringandComputerSciencesUniversityofCalifornia,Berkeley
TechnicalReportNo.UCB/EECS-2025-121
/Pubs/TechRpts/2025/EECS-2025-121.html
May16,2025
Copyright?2025,bytheauthor(s).
Allrightsreserved.
Permissiontomakedigitalorhardcopiesofallorpartofthisworkfor
personalorclassroomuseisgrantedwithoutfeeprovidedthatcopiesare
notmadeordistributedforprofitorcommercialadvantageandthatcopiesbearthisnoticeandthefullcitationonthefirstpage.Tocopyotherwise,torepublish,topostonserversortoredistributetolists,requirespriorspecificpermission.
BenchmarkingLLMsonAdvancedMathematicalReasoning
by
JonathanYue
Athesissubmittedinpartialsatisfactionofthe
requirementsforthedegreeof
MasterofScience
in
ElectricalEngineeringandComputerSciences
inthe
GraduateDivision
ofthe
UniversityofCalifornia,Berkeley
Committeeincharge:
ProfessorDanKlein,ChairProfessorGireejaRanade
Spring2025
1
Abstract
BenchmarkingLLMsonAdvancedMathematicalReasoning
by
JonathanYue
MasterofScienceinElectricalEngineeringandComputerSciences
UniversityofCalifornia,Berkeley
ProfessorDanKlein,Chair
LargeLanguageModels(LLMs)haveimproveddramaticallyatmathematicalreasoning,progressingfrombasicarithmetictoolympiadlevelproofs.However,theexisting,short-answerbasedbenchmarkscansufferfromlimitedscopeforcomplexreasoningandthereforedonotsufficientlymeasurethereasoningcapabilitiesofLLMs.Formalproof-basedbench-marksexist,buttheneedtoconvertproblemstatementsintoformallanguageslimitsthescopeoftheproblems.Apotentialreasonforthissignificantgapincurrentliteratureisthedifficultyingradingproofproblemsatscale.Toaddressthis,wefirstproposeanLLM-as-a-judgeframeworktojudgemodel-generatedproofsandevaluateditsefficacy.Then,weproposeabenchmarkof77PhD-levelproofquestions,drawnfromRomanVershynin’s“High-DimensionalProbability:AnIntroductionwithApplicationsinDataScience”,andchallengedstate-of-the-artLLMswiththesequestions.WeevaluatedtheLLM-generatedsolutionsusingtheLLM-as-a-judgeframeworkandfoundthat,ingeneral,state-of-the-artLLMsarestillunabletoadequatelycompletetheseproofs.
i
Contents
Contents
i
1Introduction
1
2Relatedworks
3
2.1LLMformathematicalreasoning
3
2.2BenchmarkingMathematicalReasoninginLargeLanguageModels
7
2.3LLM-as-a-JudgeforEvaluatingMathematicalReasoning
10
3Datacollection
13
3.1Groundtruthhumanproofevaluation
13
3.2Mathematicaltextandgroundtruthsolutions
14
4LLM-as-a-judgeformathematicalproofs
16
4.1Evaluation
18
5LLMnaturallanguageproof-writing
21
5.1Methodology
21
5.2Results
23
6Discussion
25
6.1Limitations
26
7ConclusionandFutureDirections
27
Bibliography
29
ii
Acknowledgments
Iamdeeplygratefultomyadvisor,DanKlein,forhissupportandmentorshipoverthepasttwoyears,andtoProfessorGireejaRanadeforbeingmysecondreaderandofferingvaluableconversationsandfeedback.Ialsowishtothankmymentor,NicholasTomlin,forhisguidanceandthoughtfulinputthroughoutthisprocess.
IwouldliketothankallmyclassmatesintheCompSci294-271course,AIinEducation,taughtbyProfessorGireejaRanadeandProfessorNargesNorouzi,whereIengagedindis-cussionsthatinspiredtheconceptionofthisproject.Iespeciallythankmyteammembersinthecourse,NoraGuoandSveinungMyhre,whoIcollaboratedwithtoobtaintheIRBapprovalandconductarelatedstudy,forhelpingshapetheideasforthiswork.
Finally,Iextendmyheartfeltthankstomyfamilyandfriendsfortheirconstantencourage-mentandsupport,whichhasbeeninvaluableineveryfacetofmylife.
1
Chapter1
Introduction
TheadventofLargeLanguageModels(LLMs)hasresultedremarkableadvancementsinartificialintelligence,withmathematicalreasoningemergingasasignificantareaofinterest.LLMshaveevolvedfromperformingrudimentaryarithmetictoassistingwithmathematicaldiscover(Romera-Paredesetal.,
2024
),drivenbyincreasedmodelsizeandsophisticatedpromptingtechniquessuchasChain-of-Thought(CoT)prompting(Weietal.,
2023
)andTool-IntegratedReasoning(TIR)(Ahnetal.,
2024
).
Despitethesestrides,aconsiderablegappersistsinevaluatingthedeeperreasoningcapabil-itiesofLLMs.Existingbenchmarkspredominantlyfocusonshort-answerormultiple-choicequestions,suchasMATH(Hendrycksetal.,
2021
)andGSM8K(Cobbeetal.,
2021
).Whileuseful,thesebenchmarksoftenemphasizethefinalnumericaloutput,potentiallyneglectingtherigoroftheintermediatereasoningstepsandsufferingfromissuessuchasbenchmarksaturationanddatacontamination(Hongetal.,
2025
;Petrovetal.,
2025
).Furthermore,theirlimitedscopemaynotsufficientlyassesscomplexconceptualunderstandingortheabil-itytoconstructintricatearguments(K.Huangetal.,
2025
).ThepoorperformanceofLLMsonproblemsthatrequirerigorousproofgeneration,asopposedtonumericalanswers,sug-gestsapotential“reasoningillusion,”wheresuccessinsometasksmightstemfrompatternmatchingortoolassistanceratherthangenuinemathematicalinsight(Petrovetal.,
2025
).
Ontheotherhand,benchmarkscenteredonformalproofs,suchasMinif2f(K.Zhengetal.,
2022
)andPutnamBench(Tsoukalasetal.,
2024
),operatewithinthestrictconfinesofsymbolicsystemslikeLeanorIsabelle.Whilevaluableforassessingmechanicallyverifiablereasoning,thenecessityoftranslatingproblemstatementsintotheseformallanguages(aut-oformalization)canbeachallengingtaskitself,whichrestrictsthebreadthofproblemsthatcanbeaddressed(Gulatietal.,
2024
;J.Zhangetal.,
2025
).Thisleavesasignificantgapinevaluatingthegenerationofnovel,naturallanguagemathematicalproofs,whicharemoreakintohumanmathematicalpracticeandcrucialforinterpretability.Aprimaryobstacletodevelopingbenchmarksforsuchproofsistheinherentdifficultyingradingthemconsistentlyandatscale,asnaturallanguageproofslacktheimmediateverifiabilityofformalproofsandoftenrequireexperthumanevaluation(Friederetal.,
2023
).
2
CHAPTER1.INTRODUCTION
Toaddresstheselimitationsandbridgethegapinevaluatingadvancedmathematicalrea-soninginnaturallanguage,thispapermakestwoprimarycontributions.First,weproposeandevaluateanLLM-as-a-judgeframeworkspecificallydesignedforassessingthecorrect-nessandcoherenceofmathematicalproofs.Thisapproachoffersascalablealternativetomanualexpertevaluation.Second,leveragingthisevaluationframework,weintroduceanewbenchmarkcomprising77PhD-levelproof-basedquestions,designedtorigorouslytesttheadvancedreasoningcapabilitiesofLLMsonproblemsrequiringnovelandintricatenaturallanguageproofs.
Wefindthat,despiterecentprogress,currentleadingLLMsaregenerallystillunabletoadequatelycompletethesecomplexproof-basedtasks.Thisunderscoresthenecessityformorechallengingbenchmarksthatprobedeeperlevelsofmathematicalunderstandingandproofgeneration,therebyprovidingamoreaccuratemeasureofLLMreasoningabilitiesandguidingfutureresearchtowardsdevelopingmorecapableandrobustAIsystemsformathematics.
3
Chapter2
Relatedworks
2.1LLMformathematicalreasoning
TheintroductionofLargeLanguageModels(LLMs)hasmarkedasignificantturninartificialintelligence.Amongthemanyapplications,mathematicalreasoningemergesasakeyareaofexplorationandadvancement.ThissectionsurveysthedevelopmentofLLMsintacklingmathematicaltasks.
AdvancesandCoreTechniquesinLLMMathematicalReasoning
LLMshaverapidlyimprovedatmathematicalreasoning,progressingfromsimplearithmetictosolvingmorecomplexproblems(Liuetal.,
2025
).Akeydriverforthisprogressistheincreaseinmodelsize.Largermodelsfollowinstructionsbetterandreasonmoreeffectively.ResearchshowsthatLLMswithover100billionparameterscansolvedifficultmathproblemswiththerightprompts(Forootani,
2025
).
Thediscoveryofseveraltechniqueshaveaugmentedthisimprovementduetomodelscale.AmajorcornerstoneinthisareaisChain-of-Thought(CoT)prompting.ItguidesLLMstoliststepsintheirreasoningbeforegivingafinalanswer.Thisresultsinanimprovementinboththeperformanceoncomplextasksandtheinterpretabilityofthereasoningprocess.Weietal.,
2023
showedthatCoTpromptingallowsLLMstodeconstructproblemstheywouldusuallyfail.ResearchershavebuiltuponbasicCoTwithmoreadvancedversionsandnewtypesof“thinking”models.ExamplesincludeSoftCoT(Y.Xuetal.,
2025
),ContinuousCoT(Cheng&Durme,
2024
),andCoconut(Haoetal.,
2024
).SoftCoT,forinstance,utilizesasmallerassistantmodeltogenerate“softthoughttokens”whicharethenprojectedintotheprimaryLLM’srepresentationspacethroughatrainablemodule.Thisfine-tuningap-proachisparameter-efficientandhasdemonstratedenhancedperformanceonmathematicalreasoningbenchmarks.
Theemergenceofmodelsexplicitlydesignedfor“thinking,”suchasOpenAI’sO1and
4
CHAPTER2.RELATEDWORKS
DeepSeek-R1,representsanotableprogression.ThesemodelsoftentakemoretimetoreasonduringinferenceandaretrainedwithReinforcementLearning(RL)tobuildadvancedcogni-tiveskillslikeself-checkingandreflection.DeepSeek-R1,forexample,isreportedtodevelopCoTreasoningcapabilitiesautonomouslythroughapureRLtrainingparadigm(DeepSeek-AIetal.,
2025
).Multi-roundThinking,whereamodelusespreviousanswerstotryagainandimprove,hasalsoledtobetterresults,asseeninmodelslikeQwQ-32BandDeepSeek-R1ondifficultbenchmarkssuchasAIME2024(Tianetal.,
2025
).RL-basedreasoningmethodsarealsobeingusedinmultimodalLLMs.Vision-R1,forinstance,usesProgressiveThinkingSuppressionTraining(PTST)toimproveperformanceinsolvingmultimodalmathproblems(W.Huangetal.,
2025
).
ToaddressLLMs’weaknessinprecisenumericalcalculationandrigoroussymbolicmanip-ulation,Tool-IntegratedReasoning(TIR)hasgainedprominence(Ahnetal.,
2024
).TIRletsLLMshandoffsubpartsofaproblemtotoolslikePythonformathorsymbolicsolversforalgebra.Thisimprovesaccuracy,especiallywhenhighprecisionorstructuredmathisneeded.Forexample,thewinningsolutionintheAIMO-2competitionusedCoTfine-tuningfirst,thenfine-tunedagainonaTIRdataset.Thishelpedthemodelcombinenaturallan-guagereasoningwithstructuredcomputation(Moshkovetal.,
2025
).TIRframeworkslikeTATA(TeachingLLMsAccordingtoTheirAptitude)(X.Xuetal.,
2025
)helpmodelschoosebetweenCoTandTIRdependingontheproblem.
Still,LLMsstrugglewithhighlyadvancedmathproblems,especiallyonesneedingfullproofslikethoseinmatholympiads.Arecentworkevaluatedstate-of-the-artLLMson2025US-AMOproblems.Theresultsshowsignificantlimits(Petrovetal.,
2025
).EvenaleadingmodellikeGemini-2.5-PROachievedanaveragescoreofonly25%,withotherprominentmodelsscoringbelow5%.Commonfailuremodesincludedflawedlogicalsteps,introductionofunjustifiedassumptions,alackofcreativeproblem-solvingstrategies,andatendencytofalselyclaimthataproblemhadbeensolved.
TheobserveddiscrepancybetweenLLMperformanceonbenchmarksrequiringnumericalanswersandthoserequiringrigorousproofgenerationsuggestsapotential“reasoningillu-sion.”WhiletechniqueslikeCoTandTIRdemonstrateimprovementsoncertaintypesofmathematicalproblems,thepoorperformanceonproof-basedtaskssuggeststhattheseim-provementsmightnotalwaystranslatetogenuine,deepmathematicalunderstanding.LLMsmaybeleveragingpatternmatchingandtool-assistedcomputationtosucceedinnumericalanswer-orientedtasks,withoutnecessarilydevelopingdeeplogicaldeductivecapabilities.Thisillustratesthecriticalneedforbenchmarksthatspecificallyevaluatethesedeeperrea-soningandproof-generationabilities.
Formalvs.InformalMathematicalReasoningbyLLMs
MathematicalreasoningbyLLMscanbebroadlycategorizedintotwodomains:formalmath-ematicalreasoning,whichoperatesundertherigoroussyntaxofsymbolicsystemsandproof
5
CHAPTER2.RELATEDWORKS
assistants,andinformalmathematicalreasoning,whichexpressesmathematicsinnaturallanguage.
FormalMathematicalReasoning
Thissub-fieldischaracterizedbyitsemphasisonmechanicalverifiabilityandlogicalrigor.
LLMsforFormalProofGeneration(Lean,Isabelle,etc.):MuchresearchisdedicatedtotrainingLLMstogenerateproofsinformallanguagessuchasLean4(J.Zhangetal.,
2025
)andIsabelle(X.Zhaoetal.,
2024
).Thegoalofthislineofresearchistodevelopsystemsthatcanautonomouslyproducemachine-verifiablemathematicalproofs.
Severalnotablemodelsandsystemshavedemonstratedprogress.AlphaGeometrymadeheadlinesbysolvingOlympiad-levelgeometryproblems.Itcombinedalanguagemodel,whichsuggestspotentiallyusefulgeometricconstructions,withasymbolicdeductionenginethatformallyverifiesthesesteps.Akeycomponentofitsdevelopmentwasthegenerationof100millionsyntheticdataexamplesfortraining(Trinhetal.,
2024
).Animprovedver-sion,AlphaGeometry2extendeditsformallanguagetohandleawiderrangeofgeometricproblemsandleveragedtheGeminiarchitecture.AlphaGeometry2reportedlysurpassedtheaverageperformanceofhumangoldmedalistsonabenchmarkofIMOgeometryproblems(Chervonyietal.,
2025
).
TheSelf-playTheoremProver(STP)addressesthecriticalissueofdatascarcityinformalmathematicsbyemployingadual-agentsystem:aconjecturerthatproposesnewtheoremsandaproverthatattemptstoprovethem.Thisself-playloopallowsthesystemtoiterativelygenerateitsowntrainingdata,whichresultsinsignificantperformanceim-provementsonvariousreasoningbenchmarks(Dong&Ma,
2025
).Goedel-Proverisanopen-sourceLLMdesignedforformalproofgenerationinLean4.Itachievedstate-of-the-artresultsbyfirstformalizingalargedatasetofnaturallanguagemathproblemsintoLean4statementsandthenemployinganexpertiterationstrategytotraintheprover(Linetal.,
2025
).Kimina-ProverPreviewdevelopedareasoning-drivenexplorationparadigm,utilizinglarge-scaleRLbasedonQwen2.5-72BtogenerateproofsinLean4.ThisapproachhasalsosetnewSOTAperformanceontheminiF2Fbenchmark(H.Wangetal.,
2025
).
Despitetheseadvancements,akeychallengeremainsthescarcityoflarge-scale,high-qualitydatasetsofformalizedmathematicalstatementsandproofs.TherigoroussyntaxofformallanguagesoftenmakesproofgenerationinthesesystemsmoredifficultforLLMscomparedtonaturallanguage(J.Zhangetal.,
2025
).
Autoformalization:Thiscriticalsub-areafocusesontheautomatictranslationofinformalmathematicallanguageintoformallanguagesthatcanbeprocessedbyproofassistants.LLMslikeGPT-4haveconsiderablyadvancedautoformalizationcapabilitiesthroughin-contextlearning.Techniquessuchasback-translation,whereexistingformalstatementsareautomaticallyinformalizedintonaturallanguagetoresultinsyntheticpairsofformalandinformalmathematicalstatements,havebeenexploredtoaugmenttrainingdata(Yanget
6
CHAPTER2.RELATEDWORKS
al.,
2024
).
However,autoformalizationremainsadifficulttask,especiallyforcomplexorabstractmath-ematicalconcepts(Gulatietal.,
2024
).LLMscanstrugglewithselectingthecorrectpream-blesorlibraryimportsfromvastformalmathematicslibrariesandmayevengeneraterefer-encestonon-existentpreambles.Evaluatingautoformalizationfaithfulness—thattheformalstatementaccuratelycapturesthesemanticsoftheoriginalinformalinput—remainsanareaofongoingresearch.
Informal(NaturalLanguage)MathematicalProofs
Thisdomainconcernsthegenerationandcomprehensionofmathematicalproofs,intheformmostfamiliartohumans:innaturallanguage.
CurrentLandscape:WhileCoTtechniquesenabledLLMstogeneratenaturallanguagestep-by-stepreasoning,therehasbeenrelativelylittleresearchintothegenerationofrigorousandverifiablenaturallanguagemathematicalproofsforcomplexproblems.MuchoftheexistingworkonmathematicalreasoninginLLMsfocusesonsolvingproblemsthatyieldnumericalorshort-formanswers.
ThegenerationandevaluationofnaturallanguageproofsbyLLMspresentauniquesetofchallenges:
?Verifiability:Afundamentaldifficultyliesintheautomatedverificationofnaturallanguageproofs.Unlikeformalproofs,whichcanbemechanicallycheckedbyproofassistants,naturallanguageproofstypicallyrequirehumanexpertevaluationorhighlysophisticatedAI-drivenverifiers.Evenminorerrorsinlogicorcalculationcaninvali-dateanentireproof,andthesecanbedifficultforcurrentsystemstoreliablydetectinanaturallanguagecontext.
?Ambiguity:Naturallanguageisinherentlyambiguousandoftenreliesonimplicitcontextualknowledgeandcommonsense.Forinstance,whetheromittingcertainstepsinamathematicalargumentisacceptabledependsontheassumedleveloffamiliarityofboththeauthorandtheaudience.LLMsmightnotfullygrasptheseambiguitiesandmaintainthenecessarylevelofprecisionthroughoutaproof.AnotherconcernisthatanLLMmayinadvertentlyexploittheseambiguitiesandpretendasifithadcompletedaproof.
?DataScarcityforComplexProofs:High-quality,large-scaledatasetsofcomplexnaturallanguageproofs,particularlythosewithdetailedstep-by-stepannotationsorexplanationsofreasoning,arescarce.Thisisespeciallytruefornicheoradvancedmathematicaltopics.
ThecomparativelysmallervolumeofresearchfocusedongeneratingandverifyingnaturallanguagemathematicalproofsbyLLMs,asopposedtoformalproofsorproblem-solving
7
CHAPTER2.RELATEDWORKS
withnumericalanswers,maybeattributedtoseveralfactors.FormalsystemsandproofassistantsmakeformalproofgenerationanattractivetargetforAIresearchasitprovidesaclear,objectivemeasureofcorrectness.Thisreliablefeedbackloopsformodeltrainingandevaluationstandsincontrasttotheinherentambiguityofnaturallanguage.Additionally,formalmathematicslibraries(e.g.,MathlibforLean)offeragrowing,structured,andverifiedcorpusofmathematicalknowledge.ThefieldofAutomatedTheoremProving(ATP)alsohasalong-standingtraditioninAI,historicallyfocusingonformallogic.
Despitethesechallenges,webelievethatthegenerationandevaluationofcomplexinformalmathematicalproofsisvitallyimportant.Naturallanguageproofsareoftenmoreinter-pretable,whichmakesthemmoredesirableinmanysettings.Forinstance,inaneducationsetting,capabilitiesincomplexmathematicalreasoningmayallowformoreefficientproofgradingorstudenthintgeneration.ThespaceofmathematicalreasoningcanalsoactasatestbedtoAIreasoninginless-definedreal-worldenvironments,whereformalizationwouldbenearlyimpossibleandwhereinterpretabilitywouldbeevermoreimportant.
2.2BenchmarkingMathematicalReasoninginLargeLanguageModels
TheevaluationofmathematicalreasoningcapabilitiesinLLMsreliesheavilyonbenchmarks.Thissectionreviewsexistingbenchmarks,categorizingthembytheirfocusandproblemtypes,andcriticallyexaminestheirlimitations.
Short-AnswerBenchmarks
ThesebenchmarkstypicallyassessLLMsonmathematicalproblemswheretheexpectedoutputisafinalnumericalansweroraconcisetextualresponse.Prominentexamplesinclude:
?MathQA(Aminietal.,2019):DerivedfromtheAQuAdataset,MathQApresentsmathwordproblemswithmultiple-choiceanswers.Itwasdesignedwithanempha-sisoninterpretabilitybymappingproblemstosequencesofpredefinedmathematicaloperations(Aminietal.,
2019
).
?MATH(Hendrycksetal.,2021b):Thisdatasetiscomprisedofchallengingprob-lemsfrommathematicscompetitionssuchastheAmericanMathematicsCompeti-tions(AMC10/12)andtheAmericanInvitationalMathematicsExamination(AIME).Whilesolutionsoftenrequiremultiplereasoningsteps,evaluationtypicallyfocusesonthecorrectnessofthefinalnumericalanswer(Hendrycksetal.,
2021
).
?GSM8K(Cobbeetal.,2021):Consistingofgradeschoolmathematicswordprob-lems,GSM8Kisdesignedtotestmulti-steparithmeticreasoning.Modelsareexpected
8
CHAPTER2.RELATEDWORKS
toproduceanumericalanswerderivedthroughasequenceofcalculations(Cobbeetal.,
2021
).
Despitetheirwidespreaduse,theseshort-answerbenchmarksfaceseveralcriticisms:
?EmphasisonFinalAnswers:Aprimarylimitationistheirpredominantfocusonthecorrectnessofthefinalanswer,oftenneglectingtheintermediatereasoningsteps.Thisevaluationapproachcaninadvertentlyrewardmodelsthatarriveatthecorrectsolutionthroughflawedreasoningprocesses(Petrovetal.,
2025
).
?BenchmarkSaturation:Manyofthesebenchmarksarerapidlyapproachingsatu-ration,withstate-of-the-artLLMsachievingveryhighaccuracyscores(e.g.,over97%onGSM8KasreportedinBudagametal.,
2024
).Thissaturationdiminishestheirefficacyindifferentiatingthecapabilitiesofadvancedmodels.
?DataContaminationandOverfitting:AsignificantconcernisthepotentialforbenchmarkdatatohavebeenincludedinthevasttrainingcorporaofLLMs.Thisdatacontaminationcanleadtoinflatedperformancemetricsthatreflectmemorizationratherthantruereasoningability.Studiesthathavecreatedperturbedversionsofthesebenchmarks,suchasGSM1k(acontamination-freeversionofGSM8K)(H.Zhangetal.,
2024
)andRV-Bench(whichintroducesrandomvariablesintoMATHproblems)(Hongetal.,
2025
),haveoftenobservednotableperformancedrops,suggestingthatmodelsmaybeoverfittingtotheoriginalbenchmarkdistributions.
?LimitedScopeforComplexReasoning:Thesebenchmarks,oftencenteredonarithmeticorbasicalgebraicmanipulation,maynotadequatelytestdeeperconcep-tualunderstanding,abstractreasoning,ortheabilitytoconstructcomplexproofs(K.Huangetal.,
2025
).MathQA,forexample,explicitlyomitsproblemsinvolvinghigher-orderpolynomialsorentirelynon-numericsolutions.
?SensitivitytoInputPerturbations:LLMperformanceonthesebenchmarkscanbesurprisinglyfragile,exhibitingsignificantdegradationinresponsetominoralterationsinproblemphrasing,numericalvalues,ortheintroductionofirrelevantinformation.Thissensitivitysuggestsalackofrobustunderstandingofunderlyingmathematicalprinciples.Forinstance,MATH-P-HardintroduceshardperturbationstoMATHprob-lemssothattheoriginalsolutionstepsdon’tapply,resultinginsignificantperformancedrops.Theresearchersfoundthatthemodelsblindlyapplylearnedproblem-solvingtacticseventowardsunsuitableproblems(K.Huangetal.,
2025
).
?LackofDiversityinProblemTypesandAssessedReasoningSkills:Existingbenchmarksmaynotencompassasufficientlybroadrangeofmathematicaltopicsorthediversecognitiveskillsrequiredforadvancedmathematicalthought.TheMaTTbenchmark,forexample,foundthatLLMperformancecanvarysignificantlyeven
9
CHAPTER2.RELATEDWORKS
acrosscloselyrelatedsubtopicswithinthesamegeneralmathematicalarea(Davoodietal.,
2025
).
Olympiad-LevelandProof-FocusedBenchmarks
Toaddressthelimitationsofshort-answerbenchmarksandtoprobemoreadvancedmathe-maticalreasoning,severalbenchmarksfocusingonOlympiad-levelproblemsandproofgen-erationhavebeendeveloped.
?Minif2f(K.Zhengetal.,
2022
):Thisbenchmarkprovidesacollectionof488formalproblemstatementsderivedfromOlympiad-levelmathematics(AIME,AMC,IMO)andundergraduatecourses,formalizedinmultipleinteractivetheoremprovingsystemslikeLean,Isabelle,andHOLLight.Itprimarilycoversalgebra,numbertheory,andinequalities,withformalizationsdonemanually.
?PutnamBench(Tsoukalasetal.,
2024
):SourcedfromthechallengingWilliam
LowellPutnamMathematicalCompetition,PutnamBenchfeatures1692hand-constructedformalizationsof640theoremsinLean4,Isabelle,andCoq.Currentmodelshave
demonstratedverylimitedsuccessonthisbenchmark,solvingonlyahandfuloftheproblems.
?Omni-math(Gaoetal.,
2025
):ThisbenchmarkaimstobeauniversalOlympiad-levelmathematicsevaluationsuite,comprising4,428problemsspanning33sub-domainsand10difficultylevels.ItutilizesGPT-4oasajudge(Omni-Judge)forevaluatingan-swers,whicharetypicallynumericalorSymPyobjects.
?FrontierMath(Glazeretal.,
2024
):Thisbenchmarkintroduceshundredsoforig-inal,exceptionallychallengingmathematicsproblemsattheadvancedundergraduate,graduate,andresearchlevels,craftedandvettedbyexpertmathematicia
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年職業(yè)指導(dǎo)師就業(yè)服務(wù)試題含答案
- 保山云南保山施甸縣消防救援局政府專職消防員招聘筆試歷年備考題庫附帶答案詳解
- 亳州2025年安徽蒙城縣事業(yè)單位公開招聘26人筆試歷年典型考點題庫附帶答案詳解
- 云南省2025云南省衛(wèi)生健康委所屬和聯(lián)系單位招聘222人筆試歷年參考題庫典型考點附帶答案詳解(3卷合一)
- 云南2025年云南中醫(yī)藥大學(xué)招聘碩士以上人員2人筆試歷年備考題庫附帶答案詳解
- 財務(wù)中層副職年終總結(jié)(3篇)
- 臨海市2025年浙江臨海市文化和廣電旅游體育局及下屬參公事業(yè)單位選調(diào)2人筆試歷年參考題庫典型考點附帶答案詳解(3卷合一)
- 中央2025年農(nóng)業(yè)農(nóng)村部規(guī)劃設(shè)計研究院招聘事業(yè)編制工作人員筆試歷年??键c試題專練附帶答案詳解
- 上海2025年上海市數(shù)字城市規(guī)劃研究中心事業(yè)單位招聘筆試歷年典型考點題庫附帶答案詳解
- 2025重慶涪陵“才聚涪州·引雁回巢”引才專項招聘410人(第一批)筆試參考題庫附帶答案詳解
- 十五五安全生產(chǎn)規(guī)劃思路
- 一年級地方課程教案
- 剪刀車專項施工方案
- 授信合同與借款合同(標(biāo)準(zhǔn)版)
- 2024-2025學(xué)年四川省綿陽市七年級(上)期末數(shù)學(xué)試卷
- 道路清掃保潔、垃圾收運及綠化服務(wù)方案投標(biāo)文件(技術(shù)標(biāo))
- 合成藥物催化技術(shù)
- 【語文】福建省福州市烏山小學(xué)小學(xué)三年級上冊期末試題(含答案)
- 建立鄉(xiāng)鎮(zhèn)衛(wèi)生院孕情第一時間發(fā)現(xiàn)制度或流程
- 2025年中級衛(wèi)生職稱-主治醫(yī)師-放射醫(yī)學(xué)(中級)代碼:344歷年參考題庫含答案解析(5卷)
- 2025年中國民航科學(xué)技術(shù)研究院招聘考試筆試試題(含答案)
評論
0/150
提交評論