通過反應描述語言架起化學與人工智能的橋梁 Bridging chemistry and artificial intelligence by a reaction description language_第1頁
通過反應描述語言架起化學與人工智能的橋梁 Bridging chemistry and artificial intelligence by a reaction description language_第2頁
通過反應描述語言架起化學與人工智能的橋梁 Bridging chemistry and artificial intelligence by a reaction description language_第3頁
通過反應描述語言架起化學與人工智能的橋梁 Bridging chemistry and artificial intelligence by a reaction description language_第4頁
通過反應描述語言架起化學與人工智能的橋梁 Bridging chemistry and artificial intelligence by a reaction description language_第5頁
已閱讀5頁,還剩24頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

ResearchGate

Seediscussions,stats,andauthorprofilesforthispublicationat:

/publication/391707103

Bridgingchemistryandarti?cialintelligencebyareactiondescription

language

ArticleinNatureMachineIntelligence·May2025DOI:10.1038/s42256-025-01032-8

CITATIONS0

READS13

12authors,including:

JiachengXiong

ShanghaiInstituteofMateriaMedica,ChineseAcademyofSciences20PUBLICATIONS439CITATIONS

SEEPROFILE

WeiZhang

ShanghaiInstituteofMateriaMedica

21PUBLICATIONS101CITATIONS

SEEPROFILE

FuZunyun

ShanghaiInstituteofMateriaMedica

38PUBLICATIONS822CITATIONS

SEEPROFILE

XiangtaiKong

ScienceforLifeLaboratory

20PUBLICATIONS92CITATIONS

SEEPROFILE

Allcontentfollowingthispagewasuploadedby

MingyueZheng

on16May2025.Theuserhasrequestedenhancementofthedownloadedfile.

Publishedonline:xxxxxxxx

naturemachineintelligence

Article

/10.1038/s42256-025-01032-8

Bridgingchemistryandartificialintelligencebyareactiondescriptionlanguage

Received:15May2024

Accepted:4April2025

Checkforupdates

JiachengXiong1,2,6,WeiZhang1,2,6,YinquanWang1,3,JiataoHuang4,YuqiShi1,2,MingyanXu1,2,ManjiaLi1,ZunyunFu1,XiangtaiKong1,2,YitianWang1,2,ZhaopingXiong

5&MingyueZheng1,2

Withthefast-paceddevelopmentofartificialintelligence,largelanguagemodelsareincreasinglyusedtotacklevariousscientificchallenges.

Acriticalstepinthisprocessisconvertingdomain-specificdataintoa

sequenceoftokensforlanguagemodelling.Inchemistry,moleculesareoftenrepresentedbymolecularlinearnotations,andchemicalreactionsaredepictedassequencepairsofreactantsandproducts.However,thisapproachdoesnotcaptureatomicandbondchangesduringreactions.Here,wepresentReactSeq,areactiondescriptionlanguagethatdefinesmoleculareditingoperationsforstep-by-stepchemicaltransformation.BasedonReactSeq,languagemodelsforretrosynthesispredictionmayconsistentlyexcelinallbenchmarktests,anddemonstratepromising

emergentabilitiesinthehuman-in-the-loopandexplainableartificial

intelligence.Moreover,ReactSeqhasallowedustoobtainuniversalandreliablerepresentationsofchemicalreactions,whichenablenavigationofthereactionspaceandaidintherecommendationofexperimentalproceduresandpredictionofreactionyields.WeforeseethatReactSeqcanserveasabridgetonarrowthegapbetweenchemistryandartificialintelligence.

Artificialintelligencetechnologies,representedbylargelanguagemodels(LMs),haveachievedunprecedentedbreakthroughsinnaturallanguageprocessing,influencingthemodelofscientificresearch

1

,

2

.Inthelifesciencesdomain,LMsarenowusedtominehiddeninforma-tionfromproteinandgenesequences,achievingremarkableresults.NotableexamplesincludeESM,whichinterpretstheproteinfunctionsandstructuresfromtheirsequences

3

,

4

,andGeneformer,whichpredictsgenefunctionsandinteractions

5

.Inchemistryandpharmaceuticals,theimportantconceptofchemicalLMs(CLMs),whichhandlechemicalmoleculesandreactions,hasalsoemerged

6

8

.

Unlikenaturallanguages,proteinsandgenes,chemicalmole-culeslackinherentsequentialrepresentations.CLMscapitalizeonchemist-definedmolecularlinearnotationstolearnandgenerate

molecularstructures.Themostcommonlyusedmolecularlinearnotationisthesimplifiedmolecularinputlineentrysystem(SMILES)

9

.Recently,toenhancetheperformanceofCLMsinspecifictasks,somenewmolecularlinearnotationsweredesigned.Forinstance,SELFIESwasdevelopedtohelpCLMsproducevalidmolecularstructures

10

,andPSMILESwasintroducedtofacilitatethelearningofpolymerrepresentations

11

.

However,thesemolecularlinearnotationsarealldesignedtodescribethestaticstructuresofchemicalmolecules.Theycannotexplicitlydescribethecrucialaspectofchemistry,namelytheprocessofatomandbondchangesinmoleculesduringachemicalreaction

12

.ThissignificantlyrestrictstheapplicationofLMsinchemicalreactionpredictionandrepresentation.CurrentLMsforchemicalreaction

1DrugDiscoveryandDesignCenter,StateKeyLaboratoryofDrugResearch,ShanghaiInstituteofMateriaMedica,ChineseAcademyofSciences,

Shanghai,China.2UniversityofChineseAcademyofSciences,Beijing,China.3DepartmentofMedicinalChemistry,SchoolofPharmacy,FudanUniversity,Shanghai,China.4SchoolofPhysicalScienceandTechnology,ShanghaiTechUniversity,Shanghai,China.5ProtonUnfoldTechnologyCo.Ltd,Suzhou,

China.6Theseauthorscontributedequally:JiachengXiong,WeiZhang.e-mail:

myzheng@

NatureMachineIntelligence

Article

/10.1038/s42256-025-01032-8

NatureMachineIntelligence

Previouslanguagemodelforreactionprediction(SMILEStoSMILES)

C1(C)=C2CCCCC2=NN1C1=CC(OC(C)C)=C(Cl)C=C1F

LM

Il

O

ClF

O

NH2

+

ONH

CC(C1CCCCC1=O)=O.

CC(OC1=CC(NN)=C(C=C1Cl)F)C

uLackofinteractivityuPoorinreactionrepresentationuLackofinterpretability

Ourproposedmethod(SMILEStoReactSeq)

F

F

22Attach

O

Break

79

2120

N

、N

6

5

1118Cl19

/

8N10

Cl

\廠16

/\

A

N

1

1213

O1514\

7Break

Reduce

O

314

O

LM

2

Attach

17

_!!<[O:1]><><[O:1]>

C1(C)_C2CCCCC2!NN!1C1=CC(OC(C)C)=C(Cl)C=C1F<[O:1]><><[O:1]>

4

3

7

5

6

10

O

1

2

8

O9

CC(C1CCCCC1=O)=O

C1(C)=C2CCCCC2=NN1C1=CC(OC(C)C)=C(Cl)C=C1F

Editingoperation

Molecule

Promptencoding

embedding

decoding

C1(C)=C2CCCCC2=NN1!C1=CC(OC(C)C)=C(Cl)C=C1F

uHuman-in-the-loopuBetterreactionrepresentationuExplainablereasoning

Fig.1|Overviewofthiswork.AcomparisonbetweenthepreviousLMforreactionpredictionbasedonSMILESandourproposedmethodbasedonReactSeq.

prediction,involvingforwardandretrosynthesisprediction,typi-callydirectlytranslatethelinearnotationsofproductsandreactantsintoeachother,whichhasbeenconsistentlycriticizedforlackinginterpretabilityandinteractivity(Fig.

1

,top)

13

15

.Recently,Wangetal.

16

andThakkaretal.

17

decomposedthetransformationfromproducttoreactantintotwosequentialsteps:first,translatingtheproductintosynthonsusingatransformer,andthentranslatingthesesynthonsintoreactantswithanothertransformer.Whilethistwo-stagedesignimprovesthemodel’sinterpretabilityandinteractivity,italsoincreasesthemodel’scomplexityandcompromisesitsend-to-endproperties.Additionally,duetothelimitationsofSMILESsyntax,thesemethodscanonlyindicatetheatomsinvolvedinthereactionwithoutdetailingtheirspecificchanges.Furthermore,whilepretrainedLMsexcelinrepresentationlearningofvarioussequencedata

18

,

19

,similaradvance-mentsforchemicalreactionsarenotablylacking.ExistingCLMscanlearnatomtransformationfromunmappedreactiondata

20

.However,generatingmeaningfulvectorrepresentationsforthistransformationprocessremainschallenging.Currentself-supervisedreactionrepre-sentationsstillstruggletoeffectivelycapturethesimilaritiesbetweendifferentreactions

21

.

Therefore,toadvancetheapplicationofLMsinchemistry,devel-opingnewlanguagesfordescribingchemicalreactionsisnecessary.Ideally,thislanguageshouldmakethereactionpredictionsmoreaccu-rateandinterpretable,enablingclarificationofthetransformationprocessofatomsandbonds.Thepredictionofthetransformationprocessshouldbecontrollable,allowingchemiststoguideLMswiththeirknowledge

22

.Moreover,thislanguageshouldenableLMstogen-eratebetterreactionrepresentationsfordiversedownstreamtasks.

Inthiswork,weintroduceareactiondescriptionlanguagenamedReactSeq,designedtomeettheaforementionedobjectives(Fig.

1

,bottom).Inspiredbyretrosynthesisprocess,ReactSeqdefinesboththeproductstructureandthemoleculareditingoperations(MEOs)

requiredtotransformitbackintoreactantmolecules.TheseMEOsincludethebreakingandchangingofchemicalbonds,alterationsinatomiccharges,andtheattachmentofleavinggroups(LGs),amongothers(Fig.

2a

).InaReactSeq-basedretrosynthesisLM,thereactantisnotgeneratedtoken-by-tokenfromscratch.Instead,itistransformedfromtheproductmoleculethroughtheseMEOs.Thisensurespreciseatommappingbetweenthepredictedreactantsandtheproducts,enhancingthemodel’sinterpretability.UsingReactSeq,avanillatrans-formercanachievestate-of-the-artperformanceinretrosynthesisprediction.Moreover,ReactSeqfeaturesexplicittokensdenotingMEOs,enablingtheencodingofhumaninstructions.Ourresultsshowthathumanexperts’promptscansignificantlyenhancethemodel’sperformanceandevenguideitinexploringnewreactions.Inaddition,theembeddingsofthoseMEOtokensprovideauniversalandreliablereactionrepresentation.Theseself-supervisedrepresentationscannaturallydistinguishbetweenreactiontypesandevaluatetheirsimi-larity,facilitatingsimilarreactionretrieval,experimentalprocedurerecommendationandreactionyieldprediction.

OverviewofReactSeq

ReactSeqconsistsoftwoparts:aheaderandatail(Fig.

2a

).Theheaderincludesthestructuraldetailsofatargetmoleculeandinformationonchangestoitsatomsandbonds,describinghowtotransformitintothecorrespondingsynthons.ThetailincludesthestructuresoftheLGsandtheirconnectionpositionswiththesynthons,describinghowtocompletethesynthonsintoreactants.InstandardSMILES,tokensfordoubleandtriplebondsarevisible,whiletokensforsinglebondsarehidden.However,thehiddentokenscanbespecifiedusingSMILESwithexplicitbonds(Fig.

2b

).ByreplacingthesebondtokensinSMILESwithMEOtokens(forexample,usinganexclamationmark‘!’torepre-sentbreakingabond),weobtaintheheaderofReactSeqthatrecordsthechangesandbreaksinchemicalbonds.Sometargetmoleculesin

Article

/10.1038/s42256-025-01032-8

a

C1(Cl)=NC=C(CBr)C(Cl)=C1

C1=C2CCC(O)C2=CC=C1F

C1=CC(OCC(=O)O)=CN=C1

CC(=O)C1=CC=CC=C1

HeaderofReactSeqTailofReactSeq

C1(Cl)=NC=C(C!Br)C(Cl)=C1<><C1(=O)CCC(=O)[N:1]1>

C1=C2CCC(;O)C2=CC=C1F

F

Changetodoublebond

OH

C1=CC(OCC(=O)[~OH])=CN=C1<[CH3:1]>

Atta8

Attachmentpoint

C[sC](_O)C1=CC=CC=C1<><>

O

CChangetosinglebondChangetoS-configuration

Breakbond

andconnecttoleavinggroup

Changebond

Directly

connectto

leavinggroup

Changebondandchangechirality

b

SMILES

C1(Cl)=NC=C(CBr)C(Cl)=C1

Cl

2

N

3

4

110

9

8

7Cl

5

6

Br

SMILESwith

explicitbonds

C1(-Cl)=N-C=C(-C-Br)-C(-Cl)=C-112345678910

Fig.2|IllustrationofReactSeq.a,SeveralrepresentativeexamplesofReactSeqdescribingMEOsinretrosynthesis,includingbondbreaking,bondchanging,

connectingtoLGsandchiralitychange.b,VisualizationofhiddentokensforsinglebondsinSMILES.

retrosynthesisdonotinvolvebreakingorchangingofbondsbetweenheavyatoms.Instead,theyaredirectlyconnectedtoLGs.Inthesecases,weinitiallyconverttheatomtokentotheexplicithydrogenmode,suchaschangingOto[OH],andthenaddacorrespondingMEOtoken(~)toit.Furthermore,changesinchirality,chargeandcis–transisomerismarealsodefinedinReactSeq.

ToobtainthetailofReactSeq,first,theatomsinthetargetmol-eculesthatcouldconnecttoLGsareidentified,knownasattachmentpoints.TheseincludeatomsdirectlyconnectedtoLGsorinvolvedinbondbreakingorreduction.TheLGsofeachattachmentpointareenclosedinanglebracketsandsortedbasedontheatomicindexesoftheirconnectedattachmentpoints.Followingthesesteps,astandardheader-to-tailReactSeqisobtained,maintaininghighalignmentwiththeSMILESofthetargetmolecule.ArelatedworktoReactSeqisthecondensedgraphofreaction(CGR),whichrepresentschemicalreac-tionsaspseudo-moleculesandcangeneratelinearnotationsforthesepseudo-molecules

23

.CGRdefinesoperationsforatomicandbondchangesbutfailstoprovideoperationsforcompletingLGs,whichrestrictsitscapacitytodepictunbalancedreactions.Additionally,thelinearrepresentationofCGRcannotensurealignmentwitheitherthereactantorproductSMILES.FurtherdetailsaboutReactSeqareprovidedintheMethodssection.

ReactSeqimprovesretrosynthesispredictionperformance

TodemonstratetheapplicationofReactSeq,wefirstuseditforret-rosynthesispredictionusingavanillatransformerwithoutanyaddi-tionalmodifications.Table

1

presentsacomprehensivecomparisonofourproposedmethodsandothermethodsontheUSPTO-50kdataset.

Ourmodeloutperformedallothers,regardlessofwhetherreactiontypesweregiven.

Ourmodelusesatwo-stageretrosynthesisreasoningstrategythatfirstidentifiesthereactioncentreandthencompletesthesyn-thons,correspondingtotheheaderandtailcomponentsofReactSeq,respectively.Whilemanygraphedit-basedmethods(forexample,G2Gs(ref.

24

),RetroXpert

25

,GraphRetro

26

)andthesequence-basedmethodRetroPrimealsousethisstrategy,theyoftenunderperformintopk(k≥3)accuracy.Thismightstemfromtheiruseofdifferentmodelsforeachstage’stasks,whichdisruptsthecontinuityoftheinformationflowbetweenthetwotasks,leadingtotheaccumu-lationoferrors.Incontrast,ReactSeq’sheader-to-tailstructureconsolidatesthedescriptionsoftworetrosynthesisstagesintoonesequence,enablingsequentialprocessingofthesetwotaskswithanend-to-endmodel,thusachievingstate-of-the-arttopkaccuracy.Furthermore,currentgraphedit-basedmethodssuchasGraph2Edits(ref.

27

),GraphRetro

26

andRetroExplainer

28

formulatesynthoncom-pletionasaclassificationproblem,selectingLGsfromapredefinedvocabulary.ThislimitstheirabilitytogeneratenewLGsandexplorenewreactions.Conversely,ourmodelgeneratesSMILESofLGstokenbytoken,allowingforthegenerationofnewLGsabsentfromthetrainingset,offeringgreaterflexibilitytodiscovernewchemicaltransformations.However,thisalsointroducestheriskofgenerat-ingchemicallyinfeasibleLGs,asdemonstratedinSupplementaryFig.1.WhilethreepredictedLGsappearchemicallyplausible,withanalogousreactionshavingbeenreported,thelastone—thedifluo-romethanesulfonicacidgroup—hasnotbeenobservedinexistingreactions.Thishighlightstheneedformorecarefulevaluationofnewpredictionsbeforeexecution.

Article

/10.1038/s42256-025-01032-8

NatureMachineIntelligence

Table1|TopkaccuracyofourproposedmethodandothermodelsonUSPTO-50kdataset

Class

Model

Topkaccuracy(%)

Reactionclassunknown

3510k=1

Reactionclassknown

k=1

3

5

10

Template-basedmethods

Retrosim

52

37.3

54.7

63.3

74.1

52.9

73.8

81.2

88.1

Neuralsym

53

44.4

65.3

72.4

78.9

55.3

76.0

81.4

85.1

GLN

54

52.5

69.0

75.6

83.7

64.2

79.1

85.2

90.0

LocalRetro

44

54.2

76.8

80.4

90.3

-

-

-

-

Graphedit-basedmethods

G2Gs(ref.

24

)

48.9

67.6

72.5

75.5

61.0

81.3

86.0

88.7

RetroXpert

25

50.4

61.1

62.3

63.4

62.1

75.8

78.5

80.9

MEGAN

55

48.1

70.7

78.4

86.1

60.7

82.0

87.5

91.6

GraphRetro

26

53.7

68.3

72.2

75.5

63.9

81.5

85.2

88.1

G2Retro(ref.

56

)

53.9

74.6

80.7

86.6

63.1

84.2

88.5

91.7

Graph2Edits(ref.

27

)

55.1

77.3

83.4

89.4

67.1

87.5

91.5

93.8

RetroExplainer

28

57.7

79.2

84.8

91.4

66.8

88.0

92.5

95.8

NAG2G(ref.

57

)

55.1

76.9

83.4

89.9

67.2

86.4

90.5

93.8

RetroCaptioner

58

54.3

76.3

82.6

88.1

67.2

86.0

90.3

93.4

Sequence-basedmethods

SCROP

59

43.7

60.0

65.2

68.7

59.0

74.8

78.1

81.1

Aug.Transformer

45

53.2

-

80.5

85.2

-

-

-

-

Dual-TF

60

53.6

70.7

74.6

77.0

65.7

81.9

84.7

85.9

BARTSmiles

50

55.6

-

74.2

80.9

-

-

-

-

RetroPrime

16

51.4

70.8

74.0

76.1

64.8

81.6

85.0

86.9

Chemformer

47

54.3

-

62.3

63.0

-

-

-

-

R-SMILES

46

56.3

79.2

86.2

91.0

-

-

-

-

ReactSeq

58.9

80.5

86.4

91.4

68.5

89.2

93.1

95.9

Note:thehyphenrepresentsthatthecorrespondingresultsarenotreportedandboldrepresentsthebestresult.TheresultsforLocalRetrocomefromtheauthor’smostrecentupdateonGitHub.

Wefurtheranalysedourmodel’sperformanceacrossvariousreac-tiontypes.Forrarereactiontypessuchascyclizationandfunctionalgrouptransformations,themodelmaintainedstrongperformance,achievingtoptenaccuraciesof84.6and91.3%,respectively(Supplemen-taryFig.2).However,theaccuracysignificantlydecreasedto65.6%forreactionsinvolvingstereochemicalchanges(SupplementaryTable1).Thisdiscrepancylikelystemsfromknowledgetransferability.Cyclizationandfunctionalgrouptransformationreactions,althoughunderrepre-sentedinourdataset,involveMEOssuchasbondbreakingandattachingLGs,whicharecommoninotherreactions.Thisallowsthemodeltoeffec-tivelytransferknowledgefrommorefrequentreactionstopredicttheserarertypes.Incontrast,stereochemicalchangesinvolveuniquerulesthatcannotbeinferredfromreactionswithoutsuchtransformations.

WealsoevaluatedourmethodonthelargerUSPTO-MITdataset,whereitdemonstratedsuperiorperformancewithaccuracyratesof60.5,78.5,83.3and87.6%forthetopone,three,fiveandtenpredictions,respectively(SupplementaryTable2).Theseresultshighlightourmethod’sapplicabilitytolarge-scaledatasets.Moreover,tovalidatethegeneralizabilityofourmodel,weconductedanexternalevalua-tionusingELN,areal-worldreactiondataset

29

.Ourmodelachievedstate-of-the-artperformanceonthisdataset(SupplementaryTable3),emphasizingitsrobustgeneralizationcapabilities.AblationstudiesaboutourReactSeqlanguageandtrainingstrategyareprovidedintheSupplementaryInformationC.1.

ReactSeqenablesinterpretableretrosynthesisprediction

Conventionalsequence-basedretrosynthesismethodsdirectlyconvertproductSMILESintoreactantSMILES,failingtodescribethespecific

transformationprocessfromproducttoreactants.ReactSeqaddressesthislimitationbydividingtheretrosynthesispredictionintotwophases:identifyingthereactioncentreandcompletingthesynthons.Thistwo-stagemoleculareditingapproachsimulatestheretrosyn-theticanalysisworkflow,aligningbetterwithhumanexpertintuitioncomparedtotheapproachinspiredbyspecificreactionmechanismsandofferinggreatergenerality

28

.

SupplementaryTable4showcasestheperformanceofourmodelandothermethodsinthetwostagesofretrosynthesis.Ourmodelachieved73.1%top-oneaccuracyinidentifyingreactioncentresand77.6%insynthoncompletion,significantlysurpassingpreviousmeth-ods.Amongthesetwo-stageretrosynthesismethods,RetroFormerisalsosequence-based.However,limitedbySMILESsyntax,themethodidentifiesonlyatomsinvolvedinbondbreakingwhenidentifyingthereactioncentre,failingtocapturethechangesofotheratomsandbonds.Duringthesynthoncompletionstage,RetroFormerdirectlytranslatessynthonSMILESintoreactantSMILES,notclarifyingtheprocessofattachingLGs.Somesequence-basedmethodsusetheatten-tionweightstoindicatereactioncentresandperformatomicmapping.However,theexplanationsprovidedbytheattentionmechanismdonotguaranteeconsistencywiththeactualtransformationsperformedbythemodel

30

,

31

.Incontrast,ReactSeqenablesLMstoaccuratelytrackthechangesofatomsandbondsthroughoutthereactionprocesswithoutanymodificationstothemodelarchitecture,offeringamorestream-linedandreliablesolutionforinterpretableretrosynthesisprediction.

Figure

3a

presentstheprocessofgeneratingReactSequsingbeamsearchbyourretrosynthesismodel,wherethetotalprobabilityistheproductoftheprobabilitiespredictedateachstep.Notably,thetotalpredictionprobabilityismainlyinfluencedbyMEOtokens,

Article

/10.1038/s42256-025-01032-8

NatureMachineIntelligence

Reactioncentreidentification

Synthoncompletion

C(=O)C1=CC=C(Cl)S1)N1C=CN=C1<>

Cl:1]>

OH:1]>

NH

P1=0.977

CC=C(Cl)O

Br:1]><>

NH

[O:1]>

P1=0.022

Rank-5

*Cl

P2=0.637

*OH

P2=0.358

*Cl

P2=0.485

*Br

P2=0.394

*O

P2=0.003

a

12

Cl11

10S1398

6O7

5HN

4

3

21

15N1416N18

17

Product

log(totalprobability)

0

?1

?2

?3

?4

?5

?6

?7

N!

CC(CC

<[

O

*

SCl

N

*

N

(=O)C1=

S

Cl

1)!N1C

C

S

=CN=C1<

[Cl:1]><>

Rank-1Rank-2Rank-3Rank-4

*

NN

*

C

0369121518212427303336394245

Steps

O

SCl

Cl

N

N

H2N

t

Rank-1predictionP=0.622

O

SCl

OH

N

N

H2N

t

Rank-2predictionP=0.350

O

O

SCl

SCl

Br

Cl

N

H

N

H

HN

HN

N

N

Rank-4predictionPt=0.009

t

Rank-3predictionP=0.011

O

SCl

O

N

N

H2N

t

Rank-5predictionP=0.002

b

600

500

Count

400

300

200

100

0

Retrosynthesisprediction

1,500

Correct

Incorrect

1,200

Count

900

600

300

0

1.00.80.60.40.20

Confidence

ReactioncentreidentificationCorrect

2,000

Incorrect

1,500

Count

1,000

500

0

1.00.80.60.40.20

Confidence

Synthoncompletion

Correct

Incorrect

1.00.80.60.40.20

Confidence

Fig.3|InterpretableretrosynthesispredictionwithReactSeq.a,Presentation

oftheinferenceprocessesofourmethod.P1andP2refertothepredictedprobabilitiesforreactioncentreidentificationandsynthoncompletion,

respectively,whilePtreferstothetotalprobabilityforfinalpredictionresult.b,Therelationshipbetweenmodel’spredictiveconfidencemeasuredby

predictiveprobabilityanditsaccuracyacrossdifferenttasks.

whichrepresentthedynamictransformationofdecomposingprod-uctmoleculesandcompletingsynthons.Theremainingtokens,usedtodescribethestaticmolecularstructurearepredictedconsistentlystable.Incontrast,thetotalpredictionprobabilityoftheSMILES-basedmodelisinfluencedbymanytokens,suggestingamoreintricatedecision-makingprocess(SupplementaryFig.3).Furthermore,thepredictionprobabilitiesfortheheaderandtailtokensinReactSeqallowcalculationofmodel’sconfidenceinitspredictionsateachstage.Thereisacleartrendwheretheaccuracyofpredictionsimprovesasthemodel’sconfidenceincreases(Fig.

3b

).

ReactSeqenablesprompt-basedreactionprediction

Inretrosynthesisprediction,humanexpertsoftenhaveinsightsregard-ingthelocationandtypeofreactionthatshouldoccur.WebelievethatincorporatingthisexpertknowledgethroughpromptscanguideLMs

togeneratemoreaccuratepredictions.Akeychallengeofthisprocessislinguisticallyencodingthediversehumanprompts.Thakkaretal.achievedthisbytaggingatomsinSMILES

17

.However,theirpromptswererestrictedtoindicatingbond-breakingpositions,failingtoencodeothertypesofhumanprompt.Conversely,ReactSeqdefinestokensrep-resentingvariousMEOs,enablingamorecomprehensiveandrefinedencodingofhumanprompts.

Todemonstratethis,wetrainedapromptlearningmodelusingReactSeq.AsshowninFig.

4a

,themodeliscapableofprocessingvari-oustypesofmoleculareditingpromptencodedbyReactSeq,andisguidedtoperformspecificreactiontransformations.Wefurthertestedthisprompt-basedlearningstrategyontheUSPTO-50kdataset,achiev-ing96.6%top-oneaccuracyinidentifyingreactioncentresand74.9%top-oneaccuracyinpredictingfinalreactants,significantlyoutper-formingthemodelwithoutprompts(Fig.

4b

).Itiscrucialtopointouttheseresultswereobtainedusingfullyaccuratehumanprompts,which

Article

/10.1038/s42256-025-01032-8

NatureMachineIntelligence

a

N

N

WhatifI…

N

OH

F

N

Predictionwithprompt1

N

BrAttach

N

NBreak

F

OH

N

N

N

N

H

Br

OH

N

F

Q1:breakthisbond?

Prompt1:N#CC1=CC=C(C(C(O)CC2=CC=C(F)C=C2)!N2C=NC=N2)C=C1

Q2:changethisbondtodoublebond?

Prompt2:N#CC1=CC=C(C(C(;O)CC2=CC=C(F)C=C2)N2C=NC=N2)C=C1

Q3:changethisbondtosinglebond?

Prompt3:N_CC1=CC=C(C(C(O)CC2=CC=C(F)C=C2)N2C=NC=N2)C=C1

N

Predictionwithprompt2

\\Mg+

N

N

N··Attach

OH

NBreak

F

Changetodoublebond

N

N

+Mg

OF

N

Predictionwithprompt3

N

N

N

O

Attach

OH

F

N

Changetosinglebond

b

100

Accuracy(%)

80

60

40

20

0

SynthonpredictionReactantprediction

99.297.2

96.6

95.8

98.599.192.9

74.9

Top-1Top-3Top-5Top-10

c

NO

Prompt

Break

N\

N

O

Output

NCl/Br/INCl/Br

HO

BHO

+Mg

N

O

Suzukicoupling

Grignardreaction

NCl/Br/I

N

O/

N

O

N+

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論