版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡(jiǎn)介
Experimental&MolecularMedicine
/10.1038/s12276-025-01583-1
ArticleinPress
Asurveyonlargelanguagemodelsinbiologyandchemistry
Received:7July2025
Accepted:27August2025
publishedonline:15November2025
Citethisarticleas:IslambekAshyrmamatov,SuJiGwak,Su-YoungJinetal.Asurveyonlargelanguagemodelsinbiologyand
chemistryExpMolMed.(2025).
https://
/10.1038/s12276-025-01583-1
IslambekAshyrmamatov,SuJiGwak,Su-YoungJin,IkhyeongJun,UmitV.Ucak,Jay-YoonLee&JuyongLee
Weareprovidinganuneditedversionofthismanuscripttogiveearlyaccesstoits?ndings.Before?nalpublication,themanuscriptwillundergofurtherediting.Pleasenotetheremaybeerrorspresentwhichaffectthecontent,andalllegaldisclaimersapply.
IfthispaperispublishingunderaTransparentPeerReviewmodelthenPeerReviewreportswillpublishwiththe?nalarticle.
?TheAuthor(s)2025.OpenAccessThisarticleislicensedunderaCreativeCommonsAttribution4.0InternationalLicense,whichpermitsuse,sharing,adaptation,distributionandreproductioninanymediumorformat,aslongasyougiveappropriatecredittotheoriginalauthor(s)andthesource,providealinktotheCreativeCommonslicence,andindicateifchangesweremade.Theimagesorotherthirdpartymaterialinthisarticleareincludedinthearticle’sCreativeCommonslicence,unlessindicatedotherwiseinacreditlinetothematerial.Ifmaterialisnotincludedinthearticle’sCreativeCommonslicenceandyourintendeduseisnotpermittedbystatutoryregulationorexceedsthepermitteduse,youwillneedtoobtainpermissiondirectlyfromthecopyrightholder.Toviewacopyofthislicence,visit
http://
/licenses/by/4.0/
.
0
ASurveyonLargeLanguageModelsinBiologyandChemistry
IslambekAshyrmamatov1t,SuJiGwak2t,Su-YoungJin2,IkhyeongJun3,UmitV.Ucak1*,Jay-YoonLee2*andJuyongLee1,3*
1ResearchInstituteofPharmaceuticalScience,CollegeofPharmacy,SeoulNationalUniversity,1
Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea
2GraduateSchoolofDataScience,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,RepublicofKorea
3DepartmentofMolecularMedicineandBiopharmaceuticalSciences,GraduateSchoolofConvergenceScienceandTechnology,SeoulNationalUniversity,1Gwanak-ro,Gwanak-gu,Seoul08826,Republicof
Korea
?Theseauthorshavecontributedequallytothiswork.
*
Correspondingauthors:braket@snu.ac.kr,lee.jayyoon@snu.ac.kr,nicole23@snu.ac.kr
Abstract
Artificialintelligence(AI)isreshapingbiomedicalresearchbyprovidingscalablecomputationalframeworkssuitedtothecomplexityofbiologicalsystems.Centraltothisrevolutionarebio/chemicallanguagemodels(LMs),includingLargeLanguageModels(LLMs),whicharere-conceptualizingmolecularstructuresasaformof"language"amenabletoadvancedcomputationaltechniques.Thisreviewcriticallyexaminestheroleofthesemodelsinbiologyandchemistry,tracingtheirevolutionfrommolecularrepresentationtomoleculargenerationandoptimization.Thisreviewcoverskeymolecularrepresentationstrategiesforbothbiologicalmacromoleculesandsmallorganiccompounds—rangingfromproteinandnucleotidesequencestosingle-celldata,string-basedchemicalformats,graph-basedencodings,and3Dpointclouds—highlightingtheirrespectiveadvantagesandinherentlimitationsinAIapplications.Thediscussionfurtherexplorescoremodelarchitectures,suchasBERT-likeencoders,GPT-likedecoders,andencoder-decodertransformers,alongsidetheirsophisticatedpre-trainingstrategieslikeself-supervisedlearning,multi-tasklearning,andretrieval-augmentedgeneration.Keybiomedicalapplications,spanningproteinstructureandfunctionprediction,denovoproteindesign,genomicanalysis,molecularpropertyprediction,denovomoleculardesign,reactionprediction,andretrosynthesis,areexploredthroughrepresentativestudiesandemergingtrends.Finally,thereviewconsiderstheemerginglandscapeofagenticandinteractiveAIsystems,showcasingbrieflytheirpotentialtoautomateandacceleratescientificdiscoverywhileaddressingcriticaltechnical,ethical,andregulatoryconsiderationsthatwillshapethefuturetrajectoryofAIinbiomedicine.
1
1.Introduction
Largelanguagemodels(LLMs),builtondeepneuralarchitecturesandtrainedonmassivetextcorpora,haveachievedstate-of-the-artperformanceinlanguageunderstanding,generation,andreasoning.Althoughoriginallydevelopedfornaturallanguage,theircoremodelingprinciplesarebroadlytransferabletosymbolicscientificdata.ThishasspurredgrowinginterestinadaptingLLMstoscientificdomains,particularlyinchemistryandbiology.1,2
Scientificknowledgeandun
derstandingcriticallydependontheconstructionofformalrepresentationsthatencodethestructureandbehaviorofphysicalandbiologicalsystems.Theserepresentationsaredesignedforfidelityincapturingdomain-specificproperties,butrarelyalignwiththedistributionalandsyntacticpatternsoflanguagemodels.Thus,variousattemptshavebeensuggestedforbetteralignmentbetweenLLMsandscientificrepresentations.3,4
WhatenablesLLMstoperformsoeffectivelyisnotanunderstandingofindividualtokens,buttheirabilitytomodelthestatisticalstructurethatgovernstokencomposition.Inscientificdomains,amodel’sabilitytoinferpropertiesdependsonhowwelltheinputrepresentationencodesunderlyingstructure.Thus,representationaldesignisnotperipheralbutfundamentalfordevelopingscientificLLMs.Itdetermineswhatmodelscanlearn,generalize,andultimately,discover.Inaddition,itiswell-knownthatthescalesofmodelarchitectureandtrainingdataarecriticalinaccuracyandemergentbehaviorsofLLMs.5Thus,thesuccessofscientificLLMsrestsonbothscaleandarchitectureofthemodels,andhoweffectivelytherepresentationtranslatesadomainstructureintoalearnableentity.
RecentprogressinusingLLMsinbiologyandchemistryhasbeenacceleratedbythegrowthofcurated,domain-specificdatasets.Molecularandproteindatabases,alongwithscientificliterature,nowsupportdiversetrainingstrategies,fromself-supervisedobjectivestomultimodalintegration.However,muchofthisdevelopmentremainsfragmented,andsystematiccomparisonsacrosschemicalandbiologicaldomainsarestilllimited.
Inthisreview,weexaminehowLLMsarebeingadaptedtotheuniquedemandsofchemicalandbiologicaltopics.Wefocusonhowrepresentations,architectures,andtrainingregimesinfluencemodelperformanceacrossdomainsandtasks.Thefoundationalchallengeliesinconvertingcomplex,multi-dimensionalmolecularinformationintoformatsthatlanguagemodelscanprocess(Fig.1).Ourgoalistoclarifywhathasbeenachieved,whatremainschallenging,andhowthesemodelswillbetterservescientificunderstanding.
2
2.Biologicallanguagemodels
Theunprecedentedsuccessoflargelanguagemodels(LLMs)hasopenedanewparadigmindataanalysis.Inthefieldofbiology,theutilizationofvariousbiologicaldatasuchasproteinsequences,6structures,7nucleotides,8andspeciestaxonomy9hasbeenconsidered.TheapplicationofTransformerarchitecturestobiologicalproblemshasledtosignificantbreakthroughs,withAlphaFold2(AF2)10andRoseTTAFold(RF)11emergingaslandmarkmodelsinproteinstructureprediction.Inparallel,ongoingresearchisbeingconductedtodescribebiologicalcomplexitymoreaccuratelywithinthemodels(seeTable1).
2.1.Proteinlanguagemodels
Thesequentialnatureofproteinhasenabledtheapplicationoflanguagemodelingtechniquesfromnaturallanguageprocessing.EarlymodelssuchasProtBERT,12MSATransformer,13andProtTrans14leveragedcoretechniquesfromthedeeplanguagemodelswhileexploringvariationsinbothinputformats,e.g.,singlesequences,multiplesequencealignments(MSAs),andarchitectures,e.g.,unidirectionalandBERT-stylebidirectionalencoders.ESMFold2achievesAlphaFold2-levelaccuracyinproteinstructurepredictionwithoutrelyingonMSAs,capturingcontextualdependenciessolelythroughlanguagemodeling.Thescalingofmodelparametersandfasterstructurepredictionhighlightthepotentialoflanguagemodelswhentrainedonlarge-scalebiologicaldata.ProtMamba15alsoshowedthatproteinlanguagemodelingisfeasiblewithoutMSAs.ThemodeladoptsaMamba16basedstatespacearchitectureinsteadofattention-basedtohandlelong-rangesequences.
Proteindesignaimstogenerateproteinswithcompletelynewfunctionsandstructures,andgenerativemodelscanplayakeyroleintheprocess.ProGen17enablescontrolledproteinsequencegenerationbyincorporatingconditioningtagsintoanautoregressivetransformerarchitecture.ProGen218andProtGPT219furtherimproveuponpreviousmodelsbyleveragingmorecomplexconditioningtagstogeneratesequencesthatsatisfybothstructuralandfunctionalconstraints.Recently,diffusionarchitectures,developedforimagegenerationfromtextprompts,havebeenadaptedforproteinstructuregeneration.RFdiffusion20incorporatesspatialconstraintsthroughSE(3)equivariance,enablingmoreefficientandphysicallyconsistentsamplingofproteinstructures.Suchstructuralmodelinghasfacilitatedscaffoldingtasks,andtoolsincludingProteinMPNN21andFoldseek22haveacceleratedadvancesinproteindesign.
2.2.Proteinstructuremodels
Proteinstructuremodelspredictthetertiarystructuresofproteinsfromtheirprimaryaminoacidsequences.Traditionally,techniquessuchasX-raycrystallography,nuclearmagneticresonance(NMR)spectroscopy,andcryo-electronmicroscopy(cryo-EM)havebeenemployedtoelucidateproteinstructures.However,theseexperimentalmethodsareoftenconstrainedbyhighcosts,timerequirements,andtechnicallimitations,resultinginaconsiderablysloweraccumulationofstructuraldatacomparedtotherapidlyexpandingnumberofknownprotein
3
sequences.23Thissequence-structuredataimbalance(e.g.,betweenUniProtKB24andthePDB7)underscorestheneedforcomputationalpredictionapproachestocomplementexperimentalefforts.
AlphaFold(AF)25andAlphaFold2(AF2)10havedemonstratedoutstandingperformanceinthefieldofproteinstructureprediction,asevidencedbytheirsuccessinCriticalAssessmentofproteinStructurePrediction13(CASP13)andCASP14,respectively.AF2consistsoftwoprimarymodules:theEvoformerandthestructuremodule.UnlikeAF,whichemploysaResNet-basedconvolutionalneuralnetwork(CNN),AF2introducesanattention-basedEvoformer,enablingefficientprocessingofMSAsandpairwiseresidueinteractions.TheEvoformercanbeinterpretedasabiology-specifictransformer,whereMSAsaretreatedassequencesinnaturallanguage,capturingevolutionarypatternsacrosshomologousproteins.Thisapproachhasbeenmorefullyrealizedinproteinlanguagemodels(pLMs),whicharedesignedtoreplaceMSAsbyimplicitlymodelingevolutionaryinformation.Thestructuremoduleallowsforend-to-endlearningfromprimarysequenceto3Dstructuralreconstruction,achievingnearexperimentalaccuracy.
Severalplatformshavebeendevelopedtoextendtheapplicabilityandaccessibilityofproteinstructuremodels.ColabFold26leveragesametagenomicsequencedatabase(ColabFoldDB)toenhancethediversityandqualityofMSAs,anditisimplementedtorunonweb-basedGPUresourcesthroughGoogleColaboratory.Thisapproachimprovesaccessibilitytohigh-accuracyproteinstructurepredictionwhileeffectivelyreducingcomputationalresourceburdens.Phyre2.227isanupgradedplatformforproteinstructureandfunctionpredictionthatmaintainsauser-friendlyinterfacewhileintegratingAlphaFold-predictedstructuresasnewtemplates.Itenableslarge-scalestructuralanalysisbyutilizingabroaderrangeofstructuraltemplatesbeyondthoseavailableinthePDB.Furthermore,itsupportsdomain-leveloptimizationandbatch-modeprediction,therebyservingasacomputationalalternativethatcomplementsexperimentalstudies.
2.3.Nucleotidelanguagemodels
Unlikenaturallanguage,DNAdoesnotpossessaninherentconceptof"words,"anditscompositionislimitedtojustfournucleotides—adenine(A),thymine(T),guanine(G),andcytosine(C)—asopposedtoproteinsequences,whicharecomposedofapproximately20aminoacids.Thislimitedalphabetreducestheoverallinformationdensity,makingthedevelopmentofeffectiveDNAlanguagemodelsmorechallenging.
Earlierapproaches,suchasDeepSite,28utilizedCNNsandrecurrentneuralnetworks(RNNs)formodelingDNAsequences.However,CNNsoftenstrugglewithcapturinglong-rangedependencies,andRNNssufferfromcomputationalinefficiencyandscalabilityissues.Toaddresstheselimitations,DNABERT29adoptedamaskedlanguagemodeling(MLM)basedonbidirectionalencoderrepresentationsfromtransformers(BERT)usingk-mertokenization(a.k.a.n-gramincomputerscience),enablingmoreeffectivesequencerepresentation.Subsequentmodels,includingGROVER30andDNABERT2,31leveragedBytePairEncoding(BPE)32—tokenizationemployedbytheSentencePiece33framework—toflexiblydefinetokenunits.Thishelpedreducesequenceinformationlossandimprovedcomputationalefficiency.Asaresult,
4
transformer-basedmodelshavebeensuccessfullyappliedtotaskssuchasidentifyingpromotersandtranscriptionfactorbindingsites(TFBSs)directlyfromDNAsequences.Caduceus34employscharacter-level(base-pair)tokenization,whichensuresrobustnesstominorsequencevariations.Furthermore,bymodelingDNAsequencesbidirectionallyandincorporatingreversecomplement(RC)equivariance,Caduceusdemonstratessuperiorperformanceontaskssuchasregulatorysitepredictionandlong-rangeSNPeffectinference.Recently,researchhasbeenperformedbeyondmaskedlanguagemodelingtowardgenerativeapproaches,suchasMegaDNA,35atransformer-basedDNAsequencegenerationmodel.
GenSLM36isanRNAlanguagemodelcapableofmutationeffectpredictionbycapturingthedifferencesbetweenoriginalandmutatedRNAsequencesandpredictingtheirfunctionaleffects.Themodelusesacodon-levelvocabulary,whichavoidsframeshiftissues,fortokenizingRNAsequences.Thestudyaddressesinputlengthsthatexceedthestandardmaximumcapacityofthestandardtransformer.Thislimitationhasbeenidentifiedasafundamentalarchitecturalbottleneckinearlyfoundationmodelsdesignedfornucleotidesequenceanalysis.Evo,37HyenaDNA,38andCaduceus34haveadoptedspecializedarchitectures,suchasHyena39andMamba,tosupportlong-sequencemodeling.
2.4.Single-celllanguagemodels
Withtheaccumulationofhigh-dimensionalgeneexpressiondata,single-celllanguagemodelshaveemergedasanewfrontierinbiology.Whileproteinsandnucleotidesarenaturallysequential,single-cellgeneexpressiondataarenotuniversallysequential.Therefore,amethodofrankinggenesbasedontheirexpressionlevelshasbeenproposed.Geneswithinacellaretreatedaswordsinasentence,andTransformer-basedmodelsareappliedtocapturetheirunderlyingdependencies,asinotherbiologicallanguagemodelingtasks.
Recentadvancesinsingle-cellrepresentationlearninghavesurpassedtraditionalmarkergene-basedapproachesincapturingcellularheterogeneity.40scBERT41addressesthislimitationbyleveragingfullgeneexpressionprofiles,achievingstrongperformanceincelltypeannotation.Geneformer42handlesthenon-sequentialnatureofgeneexpressiondatabyorderinggenesbasedoncountstatistics,alsoshowingeffectivenessinclassificationtasks.Buildingonthis,scGPT43takesgeneembeddingsasinputtokensandoutputsacellembedding,jointlylearningrepresentationsatbothlevels.Itachievesstate-of-the-artresultsacrosstaskssuchascelltypeclassification,perturbationprediction,batchcorrection,andmulti-omicsintegration.Thesefindingsemphasizethevalueoflarge-scalesingle-celldatasets(e.g.,theHumanCellAtlas,44CellMarker45)andthepotentialofembeddingmodelstocapturecellularcomplexity.
Atthesametime,approacheshavebeenproposedtoleveragegeneral-purposeLLMsfordirectlyincorporatingpriorbiologicalknowledge,goingbeyondgenesequencemodelingalone.Forexample,despitebeingtrainedoncommonhumanlanguages,GPT-4hasshowntheabilitytoperformautomaticcelltypeannotationbasedontextpromptsdescribinggeneexpressionlevels.46Accordingly,GenePT47andscELMo48haveconstructedgene-andcell-levelembeddingsbyapplyingtextembeddingAPIsfromacorpusofbiomedicalliteratureincludingtheNCBIdatabase.Ithasbeenreportedtooutperformsomebiologicaldata-drivenmodelssuch
5
asGeneformer.42Inaddition,CancerGPT,49aGPT-350modelfine-tunedoncorporaoftext,predictsdrugresponsepairswithinraretissuetypesbyaligningtextualrepresentationswithcellularinformation.Developingdisease-specificmodelswithrefinedcellembeddingsmayfurtheradvanceprecisionmedicine.
2.5.Biomoleculerepresentations
Biologicalmacromoleculessuchasproteinsandnucleicacidscanberepresentedthroughdiversemodalitiestosupportmachinelearningapplications.Sequence-basedrepresentationsuseaminoacidornucleotidestringsandserveasthefoundationforproteinandgenomiclanguagemodelssuchasESM,2ProtBERT,12andDNABERT29,31.Structuralrepresentationscapturespatialinformationusingatomiccoordinates,contactmaps,ordistancematrices,whichareleveragedinstructuremodelslikeAFandESMFold.Graph-basedapproachesabstractbiomoleculesintonodesandedges,enablingtheuseofgeometricdeeplearningmodelssuchasSE(3)Transformer.51FunctionalrepresentationsincludeGeneOntologyterms,proteinfamilyannotations,andsubcellularlocalization,enrichingmodelswithbiologicalcontext.Atthecellularlevel,omicsdatalikescRNA-seqisencodedashigh-dimensionalexpressionvectors.
2.6.Tokenizationstrategies
Tokenizationmethodshaveevolvedfromtraditionalmachinelearningtechniques,includingk-merapproaches,52tobiomolecule-specializedstrategiessuchasstructure-andcodon-basedtokenization,53whicharecriticalforaccurateanddetailedbiomolecularmodeling.Inproteinandnucleotidemodels,k-mertokenization(e.g.,3-mer,6-mer)isusedtocapturelocalbiochemicalcontext,asseeninDNABERTandProtBERT.Somemodelsusebyte-pairencoding(BPE)orunigrammodelstrainedonlargecorporaofsequences,suchasDNABERT2,ESM,andProGen.Codon-basedorcodon-preservingtokenizationarealsoadoptedtoavoidframe-shiftartifactsinnucleotidemodeling.scBERTemploysthegene2vecapproachtogenerategeneembeddings,whichfacilitatestheapplicationoftheBERTarchitecturetosingle-cellRNAsequencingdata.Thesecustomizedstrategiesensureefficientrepresentationofbiologicalsyntaxandsemanticsinpretrainedlanguagemodels.
2.7.ApplicationofBLMsinBiomedicine
2.7.1.Integrativemodelingformolecularcellbiology
AF2demonstratedthestrengthofAIinproteinstructurepredictionandhassinceinspiredawiderangeoffollow-upstudies.ModelssuchasAlphaFold3,54RoseTTAFoldNA,55andRoseTTAFoldAll-Atom56extendtheirfocusbeyondproteinstoincludeotherbiologicallyrelevantmoleculessuchasRNA,DNA,andligands.Inparticular,all-atomstructurepredictionintroducescomputationalchallengesinaccuratelyreconstructing3Dcoordinates.Thisreflectsagrowingrecognitionthatstructuralaccuracyisessentialforunderstandingbiomolecularfunction
6
notonlyinproteins,butalsoinRNA,wherestructureplaysacriticalroleinregulatoryactivity.57Concurrently,largelanguagemodel(LLM)-basedmethodshavebeguntoincorporatestructuralinformation,movingbeyondsequencemodeling.ESM358jointlyembedssequence,structure,andfunctionmarkingatransitiontowardmultimodalrepresentation.SpecializedmodelssuchasESM-DBP59havealsobeendevelopedtopredictDNA-bindingproteins,adoptinghybridapproachesthatleveragebothsequenceandstructurefeatures.Inthecontextofunifiedmodelinginbiologicallanguagemodels,foundationmodelsaimtolearncomprehensivecellularrepresentationsbyintegratingdiversebiologicalmodalities.Theseincludeepigeneticmarks,spatialtranscriptomics,proteinexpressiondata,andperturbationsignatures,whichcanbeexploredtogainadeeperunderstandingintocellularfunction.60Theintegrationsignalsabroadershiftfrommodality-specificmodelstowardunifiedrepresentationsthatmorereasonablyreflecttheinherentcomplexityofbiologicalsystems.
2.7.2.Multimodalfoundationmodels
MultimodalLargeLanguageModels(MLLMs)offeraframeworkforaligningheterogeneousdatatypessuchasclinicalnotes,proteinsequences,andmolecularstructures.
BiomedGPT61alignsnaturallanguagewithbiomedicalmodalities,particularlyvisualrepresentations,toenablecross-modalreasoningforvisual-languagetasks.Itfocusesonapplicationssuchasdiagnosis,summarization,clinicaldecisionsupportthroughflexiblequeryanswering.However,suchmodelsstillexhibitlimitationsinreasoningacrosscomplexclinicalscenarios,includingtheinterpretationofradiologicalimagesandtheresolutionoftextualconflicts.MediConfusion62providesadiagnosticbenchmarkthatsystematicallyevaluatesfailuremodesofmultimodalmedicalLLMs.
Tx-LLM63leveragestheadvantagesoflarge-scalepretrainingondiversebiologicaldatasets.Specifically,Itistrainedonsequence-levelinformationencompassingRNA,DNA,proteinsequences,aswellasSMILES.Thiscomprehensiveapproachenablespositivetransferperformanceinend-to-enddrugdiscoverytasks,outperformingmodelsthatdonotincorporatebiologicalsequencedata.Similarly,BioMedGPT-10B64contributestodrugdiscoverybyspecializinginproteinandmoleculeQuestionandAnswering(QA),havingbeentrainedoncellsequences,proteinandmoleculestructures.TheseadvancementshighlightthepotentialofLLMstoserveasunifiedmultimodalplatformsinbiomedicine.(Fig.2).
3.Chemicallanguagemodels
ChemicalLanguageModels(CLMs)havebeensuggestedtolearnthestructure-activityrelationshipofsmallmoleculesfromlarge-scalechemicaldatausingvarioussequentialrepresentationsofmolecules,e.g.SimplifiedMolecularInputLineEntrySystem(SMILES).65
7
3.1.ModelsTypes
SimilartopLMs,mostCLMsleverageTransformerarchitectures,66akintothoseinnaturallanguageprocessing,tounderstand,generate,andmanipulatechemicalstructuresandreactions.Thesemodelsarebroadlycategorizedbasedontheirarchitecturaldesign,eachoptimizedfordistincttaskswithincheminformaticsanddrugdiscovery.Theprimarymodeltypesincludeencoder-only(BERT-like)models,decoder-only(GPT-like)models,andencoder-decoderarchitectures,aswellasemergingmulti-modalLLMsthatintegratediversedataformats(Fig.3).Thesearchitecturalchoicesdictatehowthemodelsprocessmolecularrepresentationsandperformtasksrangingfrompropertypredictiontodenovomoleculardesignandretrosynthesis.
3.1.1.Chemicalencoders
Encoder-onlytransformermodels,primarilyinspiredbyBERT,aredesignedtoextractcontextualrepresentationsofmoleculesandarewell-suitedforpropertypredictionandmolecularunderstanding.ChemBERTa67adaptstheRoBERTa68frameworkwithMLMandmultitaskregression,whereauxiliarypropertypredictiontasksaredefinedusingmolecularfeaturescomputedbyRDKit69.Mol-BERT70appliesMLMtolearnchemicallyinformedtoken-leveldependenciesandisfine-tunedfortaskssuchaspropertyclassificationandactivityprediction.MoLFormer71extendsthisapproachusinglinearattentionandrotaryembeddings,yieldingcompactrepresentationsusefulfordownstreamregressionandclassificationtasks,thoughitislimitedtorelativelysmallmolecules.Furtherencodervariantsrefinetokenrepresentationsorintegratestructuralpriors.MolRoPE-BERT72enhancespositionalencoding,whileMFBERT,73SELFormer,74andsemi-RoBERTa75introducearchitecturalmodificationsforgreaterchemicalexpressiveness.Graph-enhancedencoderslikeGROVER76incorporatetopologicalfeaturesdirectly,bridgingthegapbetweensequenceandgraphrepresentations.
3.1.2.Chemicaldecoders
Decoder-onlytransformermodels,followingtheGPTarchitecture,areoptimizedforautoregressivegenerationandhavebecomeessentialindenovomoleculardesign.MolGPT77prioritizescausalitytolearntoken-wisedependenciesandultimatelygeneratesnovelmolecules.Itsupportsconditionalgenerationstrategiestobiasoutputstowardspecificchemicalproperties.GP-MoLFormer78isadecoder-onlyadaptationofMoLFormer-XL71andoptimizedfortaskssuchasunconstrainedmoleculegeneration,scaffoldcompletion,andconditionalpropertyoptimization.OtherGPT-basedchemicalmodelsincludeSMILES-GPT79andiupacGPT,80bothadaptedfromGPT-281formolecularandnomenclaturesequencegeneration.cMolGPT82extendsthisframeworkforcontrollablegenerationunderpropertyorscaffoldconstraints.Taiga83combinesGPTmodelingwithreinforcementlearningtoguidemoleculesynthesistowardmulti-objectivegoals.
8
3.1
溫馨提示
- 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年昆明市尋甸縣公安局招聘警務(wù)輔助人員(37人)參考考試題庫(kù)附答案解析
- 零售戶經(jīng)營(yíng)安全培訓(xùn)課件
- 2026貴州貴陽(yáng)市某事業(yè)單位勞務(wù)派遣工作人員招聘?jìng)淇伎荚囋囶}附答案解析
- 2026年上半年云南省發(fā)展和改革委員會(huì)所屬事業(yè)單位招聘人員(4人)參考考試試題附答案解析
- 2026廣西柳州事業(yè)單位招聘1111人參考考試試題附答案解析
- 2026年上半年黑龍江事業(yè)單位聯(lián)考省教育廳招聘1人備考考試試題附答案解析
- 2026年沂南縣部分事業(yè)單位公開(kāi)招聘綜合類崗位工作人員28人參考考試試題附答案解析
- 2026遼寧省文物考古研究院招聘3人參考考試題庫(kù)附答案解析
- 安全生產(chǎn)保障金制度
- 藥品廠生產(chǎn)管理規(guī)范制度
- (一診)重慶市九龍坡區(qū)區(qū)2026屆高三學(xué)業(yè)質(zhì)量調(diào)研抽測(cè)(第一次)物理試題
- 2026年榆能集團(tuán)陜西精益化工有限公司招聘?jìng)淇碱}庫(kù)完整答案詳解
- 2026廣東省環(huán)境科學(xué)研究院招聘專業(yè)技術(shù)人員16人筆試參考題庫(kù)及答案解析
- 2026年保安員理論考試題庫(kù)
- 駱駝祥子劇本殺課件
- 2025年人保保險(xiǎn)業(yè)車險(xiǎn)查勘定損人員崗位技能考試題及答案
- 被動(dòng)關(guān)節(jié)活動(dòng)訓(xùn)練
- GB/T 5781-2025緊固件六角頭螺栓全螺紋C級(jí)
- 教師心理素養(yǎng)對(duì)學(xué)生心理健康的影響研究-洞察及研究
- DGTJ08-10-2022 城鎮(zhèn)天然氣管道工程技術(shù)標(biāo)準(zhǔn)
- 公路工程質(zhì)量管理制度范本
評(píng)論
0/150
提交評(píng)論