關(guān)于生物學(xué)和化學(xué)領(lǐng)域大型語(yǔ)言模型的綜述 A survey on large language models in biology and chemistry

上傳人：1*** IP屬地：山西上傳時(shí)間：2025-12-08 格式：DOCX 頁(yè)數(shù)：54 大?。?46.23KB 積分：19.9 舉報(bào) 版權(quán)申訴

關(guān)于生物學(xué)和化學(xué)領(lǐng)域大型語(yǔ)言模型的綜述 A survey on large language models in biology and chemistry_第2頁(yè)

關(guān)于生物學(xué)和化學(xué)領(lǐng)域大型語(yǔ)言模型的綜述 A survey on large language models in biology and chemistry_第3頁(yè)

關(guān)于生物學(xué)和化學(xué)領(lǐng)域大型語(yǔ)言模型的綜述 A survey on large language models in biology and chemistry_第4頁(yè)

關(guān)于生物學(xué)和化學(xué)領(lǐng)域大型語(yǔ)言模型的綜述 A survey on large language models in biology and chemistry_第5頁(yè)

已閱讀5頁(yè)，還剩49頁(yè)未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說(shuō)明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

Experimental&MolecularMedicine

/10.1038/s12276-025-01583-1

ArticleinPress

Asurveyonlargelanguagemodelsinbiologyandchemistry

Received:7July2025

Accepted:27August2025

publishedonline:15November2025

Citethisarticleas:IslambekAshyrmamatov,SuJiGwak,Su-YoungJinetal.Asurveyonlargelanguagemodelsinbiologyand

chemistryExpMolMed.(2025).

https://

/10.1038/s12276-025-01583-1

IslambekAshyrmamatov,SuJiGwak,Su-YoungJin,IkhyeongJun,UmitV.Ucak,Jay-YoonLee&JuyongLee

Weareprovidinganuneditedversionofthismanuscripttogiveearlyaccesstoits?ndings.Before?nalpublication,themanuscriptwillundergofurtherediting.Pleasenotetheremaybeerrorspresentwhichaffectthecontent,andalllegaldisclaimersapply.

IfthispaperispublishingunderaTransparentPeerReviewmodelthenPeerReviewreportswillpublishwiththe?nalarticle.

?TheAuthor(s)2025.OpenAccessThisarticleislicensedunderaCreativeCommonsAttribution4.0InternationalLicense,whichpermitsuse,sharing,adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicence,andindicateifchangesweremade.Theimagesorotherthirdpartymaterialinthisarticleareincludedinthearticle’sCreativeCommonslicence,unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnotincludedinthearticle’sCreativeCommonslicenceandyourintendeduseisnotpermittedbystatutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfromthecopyrightholder.Toviewacopyofthislicence,visit

http://

/licenses/by/4.0/

ASurveyonLargeLanguageModelsinBiologyandChemistry

IslambekAshyrmamatov1t,SuJiGwak2t,Su-YoungJin2,IkhyeongJun3,UmitV.Ucak1*,Jay-YoonLee2*andJuyongLee1,3*

1ResearchInstituteofPharmaceuticalScience,CollegeofPharmacy,SeoulNationalUniversity,1

Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea

2GraduateSchoolofDataScience,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea

3DepartmentofMolecularMedicineandBiopharmaceuticalSciences,GraduateSchoolofConvergenceScienceandTechnology,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,Republicof

Korea

?Theseauthorshavecontributedequallytothiswork.

Correspondingauthors:braket@snu.ac.kr,lee.jayyoon@snu.ac.kr,nicole23@snu.ac.kr

Abstract

Artificialintelligence(AI)isreshapingbiomedicalresearchbyprovidingscalablecomputationalframeworkssuitedtothecomplexityofbiologicalsystems.Centraltothisrevolutionarebio/chemicallanguagemodels(LMs),includingLargeLanguageModels(LLMs),whicharere-conceptualizingmolecularstructuresasaformof"language"amenabletoadvancedcomputationaltechniques.Thisreviewcriticallyexaminestheroleofthesemodelsinbiologyandchemistry,tracingtheirevolutionfrommolecularrepresentationtomoleculargenerationandoptimization.Thisreviewcoverskeymolecularrepresentationstrategiesforbothbiologicalmacromoleculesandsmallorganiccompounds—rangingfromproteinandnucleotidesequencestosingle-celldata,string-basedchemicalformats,graph-basedencodings,and3Dpointclouds—highlightingtheirrespectiveadvantagesandinherentlimitationsinAIapplications.Thediscussionfurtherexplorescoremodelarchitectures,suchasBERT-likeencoders,GPT-likedecoders,andencoder-decodertransformers,alongsidetheirsophisticatedpre-trainingstrategieslikeself-supervisedlearning,multi-tasklearning,andretrieval-augmentedgeneration.Keybiomedicalapplications,spanningproteinstructureandfunctionprediction,denovoproteindesign,genomicanalysis,molecularpropertyprediction,denovomoleculardesign,reactionprediction,andretrosynthesis,areexploredthroughrepresentativestudiesandemergingtrends.Finally,thereviewconsiderstheemerginglandscapeofagenticandinteractiveAIsystems,showcasingbrieflytheirpotentialtoautomateandacceleratescientificdiscoverywhileaddressingcriticaltechnical,ethical,andregulatoryconsiderationsthatwillshapethefuturetrajectoryofAIinbiomedicine.

1.Introduction

Largelanguagemodels(LLMs),builtondeepneuralarchitecturesandtrainedonmassivetextcorpora,haveachievedstate-of-the-artperformanceinlanguageunderstanding,generation,andreasoning.Althoughoriginallydevelopedfornaturallanguage,theircoremodelingprinciplesarebroadlytransferabletosymbolicscientificdata.ThishasspurredgrowinginterestinadaptingLLMstoscientificdomains,particularlyinchemistryandbiology.1,2

Scientificknowledgeandun

derstandingcriticallydependontheconstructionofformalrepresentationsthatencodethestructureandbehaviorofphysicalandbiologicalsystems.Theserepresentationsaredesignedforfidelityincapturingdomain-specificproperties,butrarelyalignwiththedistributionalandsyntacticpatternsoflanguagemodels.Thus,variousattemptshavebeensuggestedforbetteralignmentbetweenLLMsandscientificrepresentations.3,4

WhatenablesLLMstoperformsoeffectivelyisnotanunderstandingofindividualtokens,buttheirabilitytomodelthestatisticalstructurethatgovernstokencomposition.Inscientificdomains,amodel’sabilitytoinferpropertiesdependsonhowwelltheinputrepresentationencodesunderlyingstructure.Thus,representationaldesignisnotperipheralbutfundamentalfordevelopingscientificLLMs.Itdetermineswhatmodelscanlearn,generalize,andultimately,discover.Inaddition,itiswell-knownthatthescalesofmodelarchitectureandtrainingdataarecriticalinaccuracyandemergentbehaviorsofLLMs.5Thus,thesuccessofscientificLLMsrestsonbothscaleandarchitectureofthemodels,andhoweffectivelytherepresentationtranslatesadomainstructureintoalearnableentity.

RecentprogressinusingLLMsinbiologyandchemistryhasbeenacceleratedbythegrowthofcurated,domain-specificdatasets.Molecularandproteindatabases,alongwithscientificliterature,nowsupportdiversetrainingstrategies,fromself-supervisedobjectivestomultimodalintegration.However,muchofthisdevelopmentremainsfragmented,andsystematiccomparisonsacrosschemicalandbiologicaldomainsarestilllimited.

Inthisreview,weexaminehowLLMsarebeingadaptedtotheuniquedemandsofchemicalandbiologicaltopics.Wefocusonhowrepresentations,architectures,andtrainingregimesinfluencemodelperformanceacrossdomainsandtasks.Thefoundationalchallengeliesinconvertingcomplex,multi-dimensionalmolecularinformationintoformatsthatlanguagemodelscanprocess(Fig.1).Ourgoalistoclarifywhathasbeenachieved,whatremainschallenging,andhowthesemodelswillbetterservescientificunderstanding.

2.Biologicallanguagemodels

Theunprecedentedsuccessoflargelanguagemodels(LLMs)hasopenedanewparadigmindataanalysis.Inthefieldofbiology,theutilizationofvariousbiologicaldatasuchasproteinsequences,6structures,7nucleotides,8andspeciestaxonomy9hasbeenconsidered.TheapplicationofTransformerarchitecturestobiologicalproblemshasledtosignificantbreakthroughs,withAlphaFold2(AF2)10andRoseTTAFold(RF)11emergingaslandmarkmodelsinproteinstructureprediction.Inparallel,ongoingresearchisbeingconductedtodescribebiologicalcomplexitymoreaccuratelywithinthemodels(seeTable1).

2.1.Proteinlanguagemodels

Thesequentialnatureofproteinhasenabledtheapplicationoflanguagemodelingtechniquesfromnaturallanguageprocessing.EarlymodelssuchasProtBERT,12MSATransformer,13andProtTrans14leveragedcoretechniquesfromthedeeplanguagemodelswhileexploringvariationsinbothinputformats,e.g.,singlesequences,multiplesequencealignments(MSAs),andarchitectures,e.g.,unidirectionalandBERT-stylebidirectionalencoders.ESMFold2achievesAlphaFold2-levelaccuracyinproteinstructurepredictionwithoutrelyingonMSAs,capturingcontextualdependenciessolelythroughlanguagemodeling.Thescalingofmodelparametersandfasterstructurepredictionhighlightthepotentialoflanguagemodelswhentrainedonlarge-scalebiologicaldata.ProtMamba15alsoshowedthatproteinlanguagemodelingisfeasiblewithoutMSAs.ThemodeladoptsaMamba16basedstatespacearchitectureinsteadofattention-basedtohandlelong-rangesequences.

Proteindesignaimstogenerateproteinswithcompletelynewfunctionsandstructures,andgenerativemodelscanplayakeyroleintheprocess.ProGen17enablescontrolledproteinsequencegenerationbyincorporatingconditioningtagsintoanautoregressivetransformerarchitecture.ProGen218andProtGPT219furtherimproveuponpreviousmodelsbyleveragingmorecomplexconditioningtagstogeneratesequencesthatsatisfybothstructuralandfunctionalconstraints.Recently,diffusionarchitectures,developedforimagegenerationfromtextprompts,havebeenadaptedforproteinstructuregeneration.RFdiffusion20incorporatesspatialconstraintsthroughSE(3)equivariance,enablingmoreefficientandphysicallyconsistentsamplingofproteinstructures.Suchstructuralmodelinghasfacilitatedscaffoldingtasks,andtoolsincludingProteinMPNN21andFoldseek22haveacceleratedadvancesinproteindesign.

2.2.Proteinstructuremodels

Proteinstructuremodelspredictthetertiarystructuresofproteinsfromtheirprimaryaminoacidsequences.Traditionally,techniquessuchasX-raycrystallography,nuclearmagneticresonance(NMR)spectroscopy,andcryo-electronmicroscopy(cryo-EM)havebeenemployedtoelucidateproteinstructures.However,theseexperimentalmethodsareoftenconstrainedbyhighcosts,timerequirements,andtechnicallimitations,resultinginaconsiderablysloweraccumulationofstructuraldatacomparedtotherapidlyexpandingnumberofknownprotein

sequences.23Thissequence-structuredataimbalance(e.g.,betweenUniProtKB24andthePDB7)underscorestheneedforcomputationalpredictionapproachestocomplementexperimentalefforts.

AlphaFold(AF)25andAlphaFold2(AF2)10havedemonstratedoutstandingperformanceinthefieldofproteinstructureprediction,asevidencedbytheirsuccessinCriticalAssessmentofproteinStructurePrediction13(CASP13)andCASP14,respectively.AF2consistsoftwoprimarymodules:theEvoformerandthestructuremodule.UnlikeAF,whichemploysaResNet-basedconvolutionalneuralnetwork(CNN),AF2introducesanattention-basedEvoformer,enablingefficientprocessingofMSAsandpairwiseresidueinteractions.TheEvoformercanbeinterpretedasabiology-specifictransformer,whereMSAsaretreatedassequencesinnaturallanguage,capturingevolutionarypatternsacrosshomologousproteins.Thisapproachhasbeenmorefullyrealizedinproteinlanguagemodels(pLMs),whicharedesignedtoreplaceMSAsbyimplicitlymodelingevolutionaryinformation.Thestructuremoduleallowsforend-to-endlearningfromprimarysequenceto3Dstructuralreconstruction,achievingnearexperimentalaccuracy.

Severalplatformshavebeendevelopedtoextendtheapplicabilityandaccessibilityofproteinstructuremodels.ColabFold26leveragesametagenomicsequencedatabase(ColabFoldDB)toenhancethediversityandqualityofMSAs,anditisimplementedtorunonweb-basedGPUresourcesthroughGoogleColaboratory.Thisapproachimprovesaccessibilitytohigh-accuracyproteinstructurepredictionwhileeffectivelyreducingcomputationalresourceburdens.Phyre2.227isanupgradedplatformforproteinstructureandfunctionpredictionthatmaintainsauser-friendlyinterfacewhileintegratingAlphaFold-predictedstructuresasnewtemplates.Itenableslarge-scalestructuralanalysisbyutilizingabroaderrangeofstructuraltemplatesbeyondthoseavailableinthePDB.Furthermore,itsupportsdomain-leveloptimizationandbatch-modeprediction,therebyservingasacomputationalalternativethatcomplementsexperimentalstudies.

2.3.Nucleotidelanguagemodels

Unlikenaturallanguage,DNAdoesnotpossessaninherentconceptof"words,"anditscompositionislimitedtojustfournucleotides—adenine(A),thymine(T),guanine(G),andcytosine(C)—asopposedtoproteinsequences,whicharecomposedofapproximately20aminoacids.Thislimitedalphabetreducestheoverallinformationdensity,makingthedevelopmentofeffectiveDNAlanguagemodelsmorechallenging.

Earlierapproaches,suchasDeepSite,28utilizedCNNsandrecurrentneuralnetworks(RNNs)formodelingDNAsequences.However,CNNsoftenstrugglewithcapturinglong-rangedependencies,andRNNssufferfromcomputationalinefficiencyandscalabilityissues.Toaddresstheselimitations,DNABERT29adoptedamaskedlanguagemodeling(MLM)basedonbidirectionalencoderrepresentationsfromtransformers(BERT)usingk-mertokenization(a.k.a.n-gramincomputerscience),enablingmoreeffectivesequencerepresentation.Subsequentmodels,includingGROVER30andDNABERT2,31leveragedBytePairEncoding(BPE)32—tokenizationemployedbytheSentencePiece33framework—toflexiblydefinetokenunits.Thishelpedreducesequenceinformationlossandimprovedcomputationalefficiency.Asaresult,

transformer-basedmodelshavebeensuccessfullyappliedtotaskssuchasidentifyingpromotersandtranscriptionfactorbindingsites(TFBSs)directlyfromDNAsequences.Caduceus34employscharacter-level(base-pair)tokenization,whichensuresrobustnesstominorsequencevariations.Furthermore,bymodelingDNAsequencesbidirectionallyandincorporatingreversecomplement(RC)equivariance,Caduceusdemonstratessuperiorperformanceontaskssuchasregulatorysitepredictionandlong-rangeSNPeffectinference.Recently,researchhasbeenperformedbeyondmaskedlanguagemodelingtowardgenerativeapproaches,suchasMegaDNA,35atransformer-basedDNAsequencegenerationmodel.

GenSLM36isanRNAlanguagemodelcapableofmutationeffectpredictionbycapturingthedifferencesbetweenoriginalandmutatedRNAsequencesandpredictingtheirfunctionaleffects.Themodelusesacodon-levelvocabulary,whichavoidsframeshiftissues,fortokenizingRNAsequences.Thestudyaddressesinputlengthsthatexceedthestandardmaximumcapacityofthestandardtransformer.Thislimitationhasbeenidentifiedasafundamentalarchitecturalbottleneckinearlyfoundationmodelsdesignedfornucleotidesequenceanalysis.Evo,37HyenaDNA,38andCaduceus34haveadoptedspecializedarchitectures,suchasHyena39andMamba,tosupportlong-sequencemodeling.

2.4.Single-celllanguagemodels

Withtheaccumulationofhigh-dimensionalgeneexpressiondata,single-celllanguagemodelshaveemergedasanewfrontierinbiology.Whileproteinsandnucleotidesarenaturallysequential,single-cellgeneexpressiondataarenotuniversallysequential.Therefore,amethodofrankinggenesbasedontheirexpressionlevelshasbeenproposed.Geneswithinacellaretreatedaswordsinasentence,andTransformer-basedmodelsareappliedtocapturetheirunderlyingdependencies,asinotherbiologicallanguagemodelingtasks.

Recentadvancesinsingle-cellrepresentationlearninghavesurpassedtraditionalmarkergene-basedapproachesincapturingcellularheterogeneity.40scBERT41addressesthislimitationbyleveragingfullgeneexpressionprofiles,achievingstrongperformanceincelltypeannotation.Geneformer42handlesthenon-sequentialnatureofgeneexpressiondatabyorderinggenesbasedoncountstatistics,alsoshowingeffectivenessinclassificationtasks.Buildingonthis,scGPT43takesgeneembeddingsasinputtokensandoutputsacellembedding,jointlylearningrepresentationsatbothlevels.Itachievesstate-of-the-artresultsacrosstaskssuchascelltypeclassification,perturbationprediction,batchcorrection,andmulti-omicsintegration.Thesefindingsemphasizethevalueoflarge-scalesingle-celldatasets(e.g.,theHumanCellAtlas,44CellMarker45)andthepotentialofembeddingmodelstocapturecellularcomplexity.

Atthesametime,approacheshavebeenproposedtoleveragegeneral-purposeLLMsfordirectlyincorporatingpriorbiologicalknowledge,goingbeyondgenesequencemodelingalone.Forexample,despitebeingtrainedoncommonhumanlanguages,GPT-4hasshowntheabilitytoperformautomaticcelltypeannotationbasedontextpromptsdescribinggeneexpressionlevels.46Accordingly,GenePT47andscELMo48haveconstructedgene-andcell-levelembeddingsbyapplyingtextembeddingAPIsfromacorpusofbiomedicalliteratureincludingtheNCBIdatabase.Ithasbeenreportedtooutperformsomebiologicaldata-drivenmodelssuch

asGeneformer.42Inaddition,CancerGPT,49aGPT-350modelfine-tunedoncorporaoftext,predictsdrugresponsepairswithinraretissuetypesbyaligningtextualrepresentationswithcellularinformation.Developingdisease-specificmodelswithrefinedcellembeddingsmayfurtheradvanceprecisionmedicine.

2.5.Biomoleculerepresentations

Biologicalmacromoleculessuchasproteinsandnucleicacidscanberepresentedthroughdiversemodalitiestosupportmachinelearningapplications.Sequence-basedrepresentationsuseaminoacidornucleotidestringsandserveasthefoundationforproteinandgenomiclanguagemodelssuchasESM,2ProtBERT,12andDNABERT29,31.Structuralrepresentationscapturespatialinformationusingatomiccoordinates,contactmaps,ordistancematrices,whichareleveragedinstructuremodelslikeAFandESMFold.Graph-basedapproachesabstractbiomoleculesintonodesandedges,enablingtheuseofgeometricdeeplearningmodelssuchasSE(3)Transformer.51FunctionalrepresentationsincludeGeneOntologyterms,proteinfamilyannotations,andsubcellularlocalization,enrichingmodelswithbiologicalcontext.Atthecellularlevel,omicsdatalikescRNA-seqisencodedashigh-dimensionalexpressionvectors.

2.6.Tokenizationstrategies

Tokenizationmethodshaveevolvedfromtraditionalmachinelearningtechniques,includingk-merapproaches,52tobiomolecule-specializedstrategiessuchasstructure-andcodon-basedtokenization,53whicharecriticalforaccurateanddetailedbiomolecularmodeling.Inproteinandnucleotidemodels,k-mertokenization(e.g.,3-mer,6-mer)isusedtocapturelocalbiochemicalcontext,asseeninDNABERTandProtBERT.Somemodelsusebyte-pairencoding(BPE)orunigrammodelstrainedonlargecorporaofsequences,suchasDNABERT2,ESM,andProGen.Codon-basedorcodon-preservingtokenizationarealsoadoptedtoavoidframe-shiftartifactsinnucleotidemodeling.scBERTemploysthegene2vecapproachtogenerategeneembeddings,whichfacilitatestheapplicationoftheBERTarchitecturetosingle-cellRNAsequencingdata.Thesecustomizedstrategiesensureefficientrepresentationofbiologicalsyntaxandsemanticsinpretrainedlanguagemodels.

2.7.ApplicationofBLMsinBiomedicine

2.7.1.Integrativemodelingformolecularcellbiology

AF2demonstratedthestrengthofAIinproteinstructurepredictionandhassinceinspiredawiderangeoffollow-upstudies.ModelssuchasAlphaFold3,54RoseTTAFoldNA,55andRoseTTAFoldAll-Atom56extendtheirfocusbeyondproteinstoincludeotherbiologicallyrelevantmoleculessuchasRNA,DNA,andligands.Inparticular,all-atomstructurepredictionintroducescomputationalchallengesinaccuratelyreconstructing3Dcoordinates.Thisreflectsagrowingrecognitionthatstructuralaccuracyisessentialforunderstandingbiomolecularfunction

notonlyinproteins,butalsoinRNA,wherestructureplaysacriticalroleinregulatoryactivity.57Concurrently,largelanguagemodel(LLM)-basedmethodshavebeguntoincorporatestructuralinformation,movingbeyondsequencemodeling.ESM358jointlyembedssequence,structure,andfunctionmarkingatransitiontowardmultimodalrepresentation.SpecializedmodelssuchasESM-DBP59havealsobeendevelopedtopredictDNA-bindingproteins,adoptinghybridapproachesthatleveragebothsequenceandstructurefeatures.Inthecontextofunifiedmodelinginbiologicallanguagemodels,foundationmodelsaimtolearncomprehensivecellularrepresentationsbyintegratingdiversebiologicalmodalities.Theseincludeepigeneticmarks,spatialtranscriptomics,proteinexpressiondata,andperturbationsignatures,whichcanbeexploredtogainadeeperunderstandingintocellularfunction.60Theintegrationsignalsabroadershiftfrommodality-specificmodelstowardunifiedrepresentationsthatmorereasonablyreflecttheinherentcomplexityofbiologicalsystems.

2.7.2.Multimodalfoundationmodels

MultimodalLargeLanguageModels(MLLMs)offeraframeworkforaligningheterogeneousdatatypessuchasclinicalnotes,proteinsequences,andmolecularstructures.

BiomedGPT61alignsnaturallanguagewithbiomedicalmodalities,particularlyvisualrepresentations,toenablecross-modalreasoningforvisual-languagetasks.Itfocusesonapplicationssuchasdiagnosis,summarization,clinicaldecisionsupportthroughflexiblequeryanswering.However,suchmodelsstillexhibitlimitationsinreasoningacrosscomplexclinicalscenarios,includingtheinterpretationofradiologicalimagesandtheresolutionoftextualconflicts.MediConfusion62providesadiagnosticbenchmarkthatsystematicallyevaluatesfailuremodesofmultimodalmedicalLLMs.

Tx-LLM63leveragestheadvantagesoflarge-scalepretrainingondiversebiologicaldatasets.Specifically,Itistrainedonsequence-levelinformationencompassingRNA,DNA,proteinsequences,aswellasSMILES.Thiscomprehensiveapproachenablespositivetransferperformanceinend-to-enddrugdiscoverytasks,outperformingmodelsthatdonotincorporatebiologicalsequencedata.Similarly,BioMedGPT-10B64contributestodrugdiscoverybyspecializinginproteinandmoleculeQuestionandAnswering(QA),havingbeentrainedoncellsequences,proteinandmoleculestructures.TheseadvancementshighlightthepotentialofLLMstoserveasunifiedmultimodalplatformsinbiomedicine.(Fig.2).

3.Chemicallanguagemodels

ChemicalLanguageModels(CLMs)havebeensuggestedtolearnthestructure-activityrelationshipofsmallmoleculesfromlarge-scalechemicaldatausingvarioussequentialrepresentationsofmolecules,e.g.SimplifiedMolecularInputLineEntrySystem(SMILES).65

3.1.ModelsTypes

SimilartopLMs,mostCLMsleverageTransformerarchitectures,66akintothoseinnaturallanguageprocessing,tounderstand,generate,andmanipulatechemicalstructuresandreactions.Thesemodelsarebroadlycategorizedbasedontheirarchitecturaldesign,eachoptimizedfordistincttaskswithincheminformaticsanddrugdiscovery.Theprimarymodeltypesincludeencoder-only(BERT-like)models,decoder-only(GPT-like)models,andencoder-decoderarchitectures,aswellasemergingmulti-modalLLMsthatintegratediversedataformats(Fig.3).Thesearchitecturalchoicesdictatehowthemodelsprocessmolecularrepresentationsandperformtasksrangingfrompropertypredictiontodenovomoleculardesignandretrosynthesis.

3.1.1.Chemicalencoders

Encoder-onlytransformermodels,primarilyinspiredbyBERT,aredesignedtoextractcontextualrepresentationsofmoleculesandarewell-suitedforpropertypredictionandmolecularunderstanding.ChemBERTa67adaptstheRoBERTa68frameworkwithMLMandmultitaskregression,whereauxiliarypropertypredictiontasksaredefinedusingmolecularfeaturescomputedbyRDKit69.Mol-BERT70appliesMLMtolearnchemicallyinformedtoken-leveldependenciesandisfine-tunedfortaskssuchaspropertyclassificationandactivityprediction.MoLFormer71extendsthisapproachusinglinearattentionandrotaryembeddings,yieldingcompactrepresentationsusefulfordownstreamregressionandclassificationtasks,thoughitislimitedtorelativelysmallmolecules.Furtherencodervariantsrefinetokenrepresentationsorintegratestructuralpriors.MolRoPE-BERT72enhancespositionalencoding,whileMFBERT,73SELFormer,74andsemi-RoBERTa75introducearchitecturalmodificationsforgreaterchemicalexpressiveness.Graph-enhancedencoderslikeGROVER76incorporatetopologicalfeaturesdirectly,bridgingthegapbetweensequenceandgraphrepresentations.

3.1.2.Chemicaldecoders

Decoder-onlytransformermodels,followingtheGPTarchitecture,areoptimizedforautoregressivegenerationandhavebecomeessentialindenovomoleculardesign.MolGPT77prioritizescausalitytolearntoken-wisedependenciesandultimatelygeneratesnovelmolecules.Itsupportsconditionalgenerationstrategiestobiasoutputstowardspecificchemicalproperties.GP-MoLFormer78isadecoder-onlyadaptationofMoLFormer-XL71andoptimizedfortaskssuchasunconstrainedmoleculegeneration,scaffoldcompletion,andconditionalpropertyoptimization.OtherGPT-basedchemicalmodelsincludeSMILES-GPT79andiupacGPT,80bothadaptedfromGPT-281formolecularandnomenclaturesequencegeneration.cMolGPT82extendsthisframeworkforcontrollablegenerationunderpropertyorscaffoldconstraints.Taiga83combinesGPTmodelingwithreinforcementlearningtoguidemoleculesynthesistowardmulti-objectivegoals.

3.1

人人文庫(kù)> 全部分類> 應(yīng)用文書(shū) > 研究報(bào)告

溫馨提示

1. 本站所有資源如無(wú)特殊說(shuō)明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

關(guān)于生物學(xué)和化學(xué)領(lǐng)域大型語(yǔ)言模型的綜述 A survey on large language models in biology and chemistry

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

關(guān)于生物學(xué)和化學(xué)領(lǐng)域大型語(yǔ)言模型的綜述 A survey on large language models in biology and chemistry

文檔簡(jiǎn)介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔