2025 DeepSeek-OCR技術報告:視覺壓縮長文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第1頁
2025 DeepSeek-OCR技術報告:視覺壓縮長文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第2頁
2025 DeepSeek-OCR技術報告:視覺壓縮長文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第3頁
2025 DeepSeek-OCR技術報告:視覺壓縮長文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第4頁
2025 DeepSeek-OCR技術報告:視覺壓縮長文本的探索性研究 DeepSeek-OCR Contexts Optical Compression_第5頁
已閱讀5頁,還剩37頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

deepseelcHaoranWei,YaofengSun,YukunLiDeepSeek-AIWepresentDeepSeek-OCRasaninitialinvestigationintothefeasibilityofcompressinglongcontextsviaoptical2Dmapping.DeepSeek-OCRconsistsoftwocomponents:DeepEncoderandDeepSeek3B-MoE-A570Masthedecoder.Specifically,DeepEncoderservesasthecoreengine,designedtomaintainlowactivationsunderhigh-resolutioninputwhileachievinghighcompressionratiostoensureanoptimalandmanageablenumberofvisiontokens.Experimentsshowthatwhenthenumberoftexttokensiswithin10timesPrecision(%)OverallPerformancePrecision(%)OverallPerformance(EditDistance)Compression(×)TextTokensinPerPage(Ground-truth)(a)CompressiononFoxbenchmarkAverageVisionTokensperImage(b)PerformanceonOmnidocbenchFigure1|Figure(a)showsthecompressionratio(numberoftexttokensingroundtruth/numberofvisiontokensmodelused)testingonFox[21]benchmark;Figure(b)showsperformancecomparisonsonOmniDocBench[27].DeepSeek-OCRcanachievestate-of-the-artperformanceamongend-to-endmodelsenjoyingthefewestvisiontokens.21Introduction32.1TypicalVisionEncodersinVLMs 2.2End-to-endOCRModels 3Methodology53.1Architecture 3.2.1ArchitectureofDeepEncoder 3.2.2Multipleresolutionsupport 3.3TheMoEDecoder 3.4DataEngine 3.4.3Generalvisiondata 3.4.4Text-onlydata 3.5TrainingPipelines 3.5.1TrainingDeepEncoder 3.5.2TrainingDeepSeek-OCR 4.1Vision-textCompressionStudy 4.2OCRPracticalPerformance 4.3QualitativeStudy 4.3.1Deepparsing 4.3.2Multilingualrecognition 4.3.3Generalvisionunderstanding 31.IntroductionCurrentLargeLanguageModels(LLMs)facesignificantcomputationalchallengeswhenprocess-inglongtextualcontentduetoquadraticscalingwithsequencelength.Weexploreapotentialsolution:leveragingvisualmodalityasanefficientcompressionmediumfortextualinformation.Asingleimagecontainingdocumenttextcanrepresentrichinformationusingsubstantiallyfewertokensthantheequivalentdigitaltext,suggestingthatopticalcompressionthroughvisiontokenscouldachievemuchhighercompressionratios.Thisinsightmotivatesustoreexaminevision-languagemodels(VLMs)fromanLLM-centricperspective,focusingonhowvisionencoderscanenhanceLLMs’efficiencyinprocessingtextualinformationratherthanbasicVQA[12,16,24,32,41]whathumansexcelat.OCRtasks,asanintermediatemodalitybridgingvisionandlanguage,provideanidealtestbedforthisvision-textcompressionparadigm,astheyestablishanaturalcompression-decompressionmappingbetweenvisualandtextualrepresentationswhileofferingquantitativeevaluationmetrics.Accordingly,wepresentDeepSeek-OCR,aVLMdesignedasapreliminaryproof-of-conceptforefficientvision-textcompression.Ourworkmakesthreeprimarycontributions:First,weprovidecomprehensivequantitativeanalysisofvision-texttokencompressionratios.Ourmethodachieves96%+OCRdecodingprecisionat9-10×textcompression,~90%at10-12×compression,and~60%at20×compressiononFox[21]benchmarksfeaturingdiversedocumentlayouts(withactualaccuracybeingevenhigherwhenaccountingforformattingdifferencesbetweenoutputandgroundtruth),asshowninFigure1(a).TheresultsdemonstratethatcompactlanguagemodelscaneffectivelylearntodecodecompressedvisualrepresentatsuggestingthatlargerLLMscouldreadilyacquiresimilarcapabilitiesthroughappropriatepretrainingdesign.Second,weintroduceDeepEncoder,anovelarchitecturethatmaintainslowactivationmem-oryandminimalvisiontokensevenwithhigh-resolutioninputs.Itseriallyconnectswindowattentionandglobalattentionencodercomponentsthrougha16×convolutionalcompressor.Thisdesignensuresthatthewindowattentioncomponentprocessesalargenumberofvisiontokens,whilethecompressorreducesvisiontokensbeforetheyenterthedenseglobalattentioncomponent,achievingeffectivememoryandtokencompression.Third,wedevelopDeepSeek-OCRbasedonDeepEncoderandDeepSeek3B-MoE[19,20].AsshowninFigure1(b),itachievesstate-of-the-artperformancewithinend-to-endmodelsonOmniDocBenchwhileusingthefewestvisiontokens.Additionally,weequipthemodelwithcapabilitiesforparsingcharts,chemicalformulas,simplegeometricfigures,andnaturalimagestoenhanceitspracticalutilityfurther.Inproduction,DeepSeek-OCRcangenerate33millionpagesofdataperdayforLLMsorVLMsusing20nodes(eachwith8A100-40GGPUs).Insummary,thisworkpresentsapreliminaryexplorationofusingvisualmodalityasanefficientcompressionmediumfortextualinformationprocessinginLLMs.ThroughDeepSeek-OCR,wedemonstratethatvision-textcompressioncanachievesignificanttokenreduction(7-20×)fordifferenthistoricalcontextstages,offeringapromisingdirectionforaddressinglong-contextchallengesinlargelanguagemodels.OurquantitativeanalysisprovidesempiricalguidelinesforVLMtokenallocationoptimization,whiletheproposedDeepEncoderarchitectureshowcasespracticalfeasibilitywithreal-worlddeploymentcapabilities.AlthoughfocusedonOCRasaproof-of-concept,thisparadigmopensnewpossibilitiesforrethinkinghowvisionandlanguagemodalitiescanbesynergisticallycombinedtoenhancecomputationalefficiencyinlarge-scaletextprocessingandagentsystems.4Vary/DeepSeekVL/...LLMLLMVITVIT[×]unsupportedpipelineparallel[×]twopre-processes[×]unsupportedextremeresolution[×]hardtodeploymentusually>15384384[×]lownativeresolution[×]toomanyvisiontokensInternVLseries/DeepSeekVL2/...LLMLLM[×]overlysmallpatches[×]smallglobalviewQwen2(.5)/3VLseries...wVITVIT(navit)tokens=(w//14(16))×(h//14(16))[×]toomanyvisiontokens[×]needlongsequencelength[×]largeactivations[×]slowinferencespeed1024224384384VITDet1024224VITLLMhFigure2|TypicalvisionencodersinpopularVLMs.Herearethreetypesofencoderscommonlyusedincurrentopen-sourceVLMs,allofwhichsufferfromtheirrespectivedeficiencies.1024224384384VITDet1024224VITLLMh2.1.TypicalVisionEncodersinVLMsCurrentopen-sourceVLMsemploythreemaintypesofvisionencoders,asillustratedinFigure2.Thefirsttypeisadual-towerarchitecturerepresentedbyVary[36],whichutilizesparallelSAM[17]encodertoincreasevisualvocabularyparametersforhigh-resolutionimageprocessing.Whileofferingcontrollableparametersandactivationmemory,thisapproachsuffersfromsignificantdrawbacks:itrequiresdualimagepreprocessingthatcomplicatesdeploymentandmakesencoderpipelineparallelismchallengingduringtraining.Thesecondtypeistile-basedmethodexemplifiedbyInternVL2.0[8],whichprocessesimagesbydividingthemintosmalltilesforparallelcomputation,reducingactivationmemoryunderhigh-resolutionsettings.Althoughcapableofhandlingextremelyhighresolutions,thisapproachhasnotablelimitationsduetoitstypicallylownativeencoderresolution(below512×512),causinglargeimagestobeexcessivelyfragmentedandresultinginnumerousvisiontokens.ThethirdtypeisadaptiveresolutionencodingrepresentedbyQwen2-VL[35],whichadoptstheNaViT[10]paradigmtodirectlyprocessfullimagesthroughpatch-basedsegmentationwithouttileparallelization.Whilethisencodercanhandlediverseresolutionsflexibly,itfacessubstantialchallengeswithlargeimagesduetomassiveactivationmemoryconsumptionthatcancauseGPUmemoryoverflow,andsequencepackingrequiresextremelylongsequencelengthsduringtraining.Longvisiontokenswillslowdownbothprefillandgenerationphasesofinference.2.2.End-to-endOCRModelsOCR,particularlydocumentparsingtask,hasbeenahighlyactivetopicintheimage-to-textdomain.WiththeadvancementofVLMs,alargenumberofend-to-endOCRmodelshaveemerged,fundamentallytransformingthetraditionalpipelinearchitecture(whichrequiredseparatedetectionandrecognitionexpertmodels)bysimplifyingOCRsystems.Nougat[6]firstemploysend-to-endframeworkforacademicpaperOCRonarXiv,demonstratingthepotentialofmodelsinhandlingdenseperceptiontasks.GOT-OCR2.0[38]expandsthescopeofOCR2.0toincludemoresyntheticimageparsingtasksanddesignsanOCRmodelwithperformance-efficiencytrade-offs,furtherhighlightingthepotentialofend-to-endOCRre-searches.Additionally,generalvisionmodelssuchasQwen-VLseries[35],InternVLseries[8],andmanytheirderivativescontinuouslyenhancetheirdocumentOCRcapabilitiestoexploredensevisualperceptionboundaries.However,acrucialresearchquestionthatcurrentmodelsneededfordecoding?Thisquestionholdssignificantimportanceforresearchintheprinciplethat5Outputn/16n/16ConvSAMCLIPConvSAMCLIPVIT300MDeepSeek-3B(MOE-A570M)VITDETDeepSeek-3B(MOE-A570M)down-sampleglobalattentionEmbeddinglayerdown-sampleglobalattentionEmbeddinglayerDecoderlocalattentionInputlocalattentionlowactivationDeepEncoderPromptTokenizerlowactivationDeepEncoderPromptTokenizerFigure3|ThearchitectureofDeepSeek-OCR.DeepSeek-OCRconsistsofaDeepEncoderandaDeepSeek-3B-MoEdecoder.DeepEncoderisthecoreofDeepSeek-OCR,comprisingthreecomponents:aSAM[17]forperceptiondominatedbywindowattention,aCLIP[29]forknowledgewithdenseglobalattention,anda16×tokencompressorthatbridgesbetweenthem.3.MethodologyAsshowninFigure3,DeepSeek-OCRenjoysaunifiedend-to-endVLMarchitectureconsistingofanencoderandadecoder.Theencoder(namelyDeepEncoder)isresponsibleforextractingimagefeaturesandtokenizingaswellascompressingvisualrepresentations.Thedecoderisusedforgeneratingtherequiredresultbasedonimagetokensandprompts.DeepEncoderisapproximately380Minparameters,mainlycomposedofan80MSAM-base[17]anda300MCLIP-large[29]connectedinseries.Thedecoderadoptsa3BMactivatedparameters.Inthefollowingparagraphs,wewilldelveintothemodelcomponents,dataengineering,andtrainingskills.Toexplorethefeasibilityofcontextsopticalcompression,weneedavisionencoderwiththefollowingfeatures:1.Capableofprocessinghighresolutions;2.Lowactivationathighresolutions;3.Fewvisiontokens;4.Supportformultipleresolutioninputs;5.Moderateparametercount.However,asdescribedintheSection2.1,currentopen-sourceencoderscannotfullysatisfyalltheseconditions.Therefore,wedesignanovelvisionencoderourselves,namedDeepEncoder.DeepEncodermainlyconsistsoftwocomponents:avisualperceptionfeatureextractioncompo-nentdominatedbywindowattention,andavisualknowledgefeatureextractioncomponentwithdenseglobalattention.Tobenefitfromthepretraininggainsofpreviousworks,weuseSAM-base(patch-size16)andCLIP-largeasthemainarchitecturesforthetwocomponentsrespectively.ForCLIP,weremovethefirstpatchembeddinglayersinceitsinputisnolongerimagesbutoutputtokensfromthepreviouspipeline.Betweenthetwocomponents,weborrowfromVary[36]andusea2-layerconvolutionalmoduletoperform16×downsamplingofvisiontokens.Eachconvolutionallayerhasakernelsizeof3,strideof2,paddingof1,andchannelsincreasefrom256to1024.Assumingweinputa1024×1024image,theDeepEncoderwillseg-mentitinto1024/16×1024/16=4096patchtokens.Sincethefirsthalfofencoderisdominatedbywindowattentionandonly80M,theactivationisacceptable.Beforeenteringglobalattention,6W:512||640ResizeMode:Tiny||SmallToken:64||100PaddingMode:Base||LargeR=1-(H-W)/WToken:256||400Valid:(256||400)×RR=1-(H-W)/W+640||1024n=6W:1024||1280640||1024n=6Mode:Gundam||Gundam(Master)Token:n×(100||256)+(256||400)Valid:n×(100||256)+(256||400)×Rn∈[2:9] H:10241280640||1024 H:10241280Figure4|Totestmodelperformanceunderdifferentcompressionratios(requiringdifferentnumbersofvisiontokens)andenhancethepracticalityofDeepSeek-OCR,weconfigureitwithmultipleresolutionmodes. H:10241280640||1024 H:10241280the4096tokensgothroughthecompressionmoduleandthetokencountbecomes4096/16=256,thusmakingtheoverallactivationmemorycontrollable.Table1|MultiresolutionsupportofDeepEncoder.Forbothresearchandapplicationpurposes,wedesignDeepEncoderwithdiversenativeresolutionanddynamicresolutionmodes.ModeNativeResolutionDynamicResolutionTinySmallBaseLargeGundamGundam-MResolutionpaddingpaddingresize+paddingresize+paddingSupposewehaveanimagewith1000opticalcharactersandwewanttotesthowmanyvisiontokensareneededfordecoding.Thisrequiresthemodeltosupportavariablenumberofvisiontokens.ThatistosaytheDeepEncoderneedstosupportmultipleresolutions.Wemeettherequirementaforementionedthroughdynamicinterpolationofpositionalencodings,anddesignseveralresolutionmodesforsimultaneousmodeltrainingtoachievethecapabilityofasingleDeepSeek-OCRmodelsupportingmultipleresolutions.AsshowninFigure4,DeepEncodermainlysupportstwomajorinputmodes:nativeresolutionanddynamicresolution.Eachofthemcontainsmultiplesub-modes.Nativeresolutionsupportsfoursub-modes:Tiny,Small,Base,andLarge,withcorrespondingresolutionsandtokencountsof512×512(64),640×640(100),1024×1024(256),and1280×1280(400)respectively.SinceTinyandSmallmodeshaverelativelysmallresolutions,toavoidwastingvisiontokens,imagesareprocessedbydirectlyresizingtheoriginalshape.ForBaseandLargemodes,inordertopreservetheoriginalimageaspectratio,imagesarepaddedtothecorrespondingsize.Afterpadding,thenumberofvalidvisiontokensislessthantheactualnumberofvisiontokens,withthecalculationformulabeing:Nvalid=?Nactual×[1?((max(w,h)?min(w,h))/(max(w,h)))]?(1)wherewandhrepresentthewidthandheightoftheoriginalinputimage.7Dynamicresolutioncanbecomposedoftwonativeresolutions.Forexample,Gundammodeconsistsofn×640×640tiles(localviews)anda1024×1024globalview.ThetilingmethodfollowingInternVL2.0[8].Supportingdynamicresolutionismainlyforapplicationconsidera-tions,especiallyforultra-high-resolutioninputs(suchasnewspaperimages).Tilingisaformofsecondarywindowattentionthatcaneffectivelyreduceactivationmemoryfurther.It’sworthnotingthatduetoourrelativelylargenativeresolutions,imageswon’tbefragmentedtoomuchunderdynamicresolution(thenumberoftilesiscontrolledwithintherangeof2to9).ThevisiontokennumberoutputbytheDeepEncoderunderGundammodeis:n×100+256,wherenisthenumberoftiles.Forimageswithbothwidthandheightsmallerthan640,nissetto0,i.e.,GundammodewilldegradetoBasemode.Gundammodeistrainedtogetherwiththefournativeresolutionmodestoachievethegoalofonemodelsupportingmultipleresolutions.NotethatGundam-mastermode(1024×1024localviews+1280×1280globalview)isobtainedthroughcontinuedtrainingonatrainedDeepSeek-OCRmodel.Thisismainlyforloadbalancing,asGundam-master’sresolutionistoolargeandtrainingittogetherwouldslowdowntheoveralltrainingspeed.3.3.TheMoEDecoderOurdecoderusestheDeepSeekMoE[19,20],specificallyDeepSeek-3B-MoE.Duringinference,themodelactivates6outof64routedexpertsand2sharedexperts,withabout570Mactivatedparameters.The3BDeepSeekMoEisverysuitablefordomain-centric(OCRforus)VLMresearch,asitobtainstheexpressivecapabilityofa3Bmodelwhileenjoyingtheinferenceefficiencyofa500Msmallmodel.ThedecoderreconstructstheoriginaltextrepresentationfromthecompressedlatentvisiontokensofDeepEncoderas:fdec:Rn×dlatent→RN×dtext;=fdec(Z)wheren≤N(2)whereZ∈Rn×dlatentarethecompressedlatent(vision)tokensfromDeepEncoderand∈RN×dtextisthereconstructedtextrepresentation.Thefunctionfdecrepresentsanon-linearmappingthatcanbeeffectivelylearnedbycompactlanguagemodelsthroughOCR-styletraining.ItisreasonabletoconjecturethatLLMs,throughspecializedpretrainingoptimization,woulddemonstratemorenaturalintegrationofsuchcapabilities.WeconstructecomplexanddiversetrainingdataforDeepSeek-OCR,includingOCR1.0data,whichmainlyconsistsoftraditionalOCRtaskssuchassceneimageOCRanddocumentOCR;OCR2.0data,whichmainlyincludesparsingtasksforcomplexartificialimages,suchascommoncharts,chemicalformulas,andplanegeometryparsingdata;Generalvisiondata,whichismainlyusedtoinjectcertaingeneralimageunderstandingcapabilitiesintoDeepSeek-OCRandpreservethegeneralvisioninterface.DocumentdataisthetoppriorityforDeepSeek-OCR.Wecollect30MpagesofdiversePDFdatacoveringabout100languagesfromtheInternet,withChineseandEnglishaccountingforapproximately25Mandotherlanguagesaccountingfor5M.Forthisdata,wecreatetwotypesofgroundtruth:coarseannotationsandfineannotations.Coarseannotationsareextracted8(a)Groundtruthimage(b)FineannotationswithlayoutsFigure5|OCR1.0fineannotationsdisplay.Weformatthegroundtruthintoaninterleavedlayoutandtextformat,whereeachparagraphoftextisprecededbythecoordinatesandlabelofitintheoriginalimage.Allcoordinatesarenormalizedinto1000bins.directlyfromthefulldatasetusingfitz,aimedatteachingthemodeltorecognizeopticaltext,especiallyinminoritylanguages.Fineannotationsinclude2MpageseachforChineseandEnglish,labeledusingadvancedlayoutmodels(suchasPP-DocLayout[33])andOCRmodels(suchasMinuerU[34]andGOT-OCR2.0[38])toconstructdetectionandrecognitioninterleaveddata.Forminoritylanguages,inthedetectionpart,wefindthatthelayoutmodelenjoyscertaingeneralizationcapabilities.Intherecognitionpart,weusefitztocreatesmallpatchdatatotrainaGOT-OCR2.0,thenusethetrainedmodeltolabelsmallpatchesafterlayoutprocessing,employingamodelflywheeltocreate600Kdatasamples.DuringthetrainingofDeepSeek-OCR,coarselabelsandfinelabelsaredistinguishedusingdifferentprompts.Thegroundtruthforfineannotationimage-textpairscanbeseeninFigure5.Wealsocollect3MWorddata,constructinghigh-qualityimage-textpairswithoutlayoutbydirectlyextractingcontent.ThisdatamainlybringsbenefitstoformulasandHTML-formattedtables.Additionally,weselectsomeopen-sourcedata[28,37]assupplements.FornaturalsceneOCR,ourmodelmainlysupportsChineseandEnglish.TheimagedatasourcescomefromLAION[31]andWukong[13],labeledusingPaddleOCR[9],with10MdatasampleseachforChineseandEnglish.LikedocumentOCR,naturalsceneOCRcanalsocontrolwhethertooutputdetectionboxesthroughprompts.FollowingGOT-OCR2.0[38],werefertochart,chemicalformula,andplanegeometryparsingdataasOCR2.0data.Forchartdata,followingOneChart[7],weusepyechartsandmatplotlib9(a)Image-textgroundtruthofchart(b)Image-textgroundtruthofgeometryFigure6|Forcharts,wedonotuseOneChart’s[7]dictionaryformat,butinsteaduseHTMLtableformataslabels,whichcansaveacertainamountoftokens.Forplanegeometry,weconvertthegroundtruthtodictionaryformat,wherethedictionarycontainskeyssuchaslinesegments,endpointcoordinates,linesegmenttypes,etc.,forbetterreadability.EachlinesegmentisencodedusingtheSlowPerception[39]manner.torender10Mimages,mainlyincludingcommonlyusedline,bar,pie,andcompositecharts.Wedefinechartparsingasimage-to-HTML-tableconversiontask,asshowninFigure6(a).Forchemicalformulas,weutilizeSMILESformatfromPubChemasthedatasourceandrenderthemintoimagesusingRDKit,constructing5Mimage-textpairs.Forplanegeometryimages,wefollowSlowPerception[39]forgeneration.Specifically,weuseperception-rulersizeas4tomodeleachlinesegment.Toincreasethediversityofrendereddata,weintroducegeometrictranslation-invariantdataaugmentation,wherethesamegeometricimageistranslatedintheoriginalimage,correspondingtothesamegroundtruthdrawnatthecenteredpositioninthecoordinatesystem.Basedonthis,weconstructatotalof1Mplanegeometryparsingdata,asillustratedinFigure6(b).DeepEncodercanbenefitfromCLIP’spretraininggainsandhassufficientparameterstoin-corporategeneralvisualknowledge.Therefore,wealsopreparesomecorrespondingdataforDeepSeek-OCR.FollowingDeepSeek-VL2[40],wegeneraterelevantdatafortaskssuchascaption,detection,andgrounding.NotethatDeepSeek-OCRisnotageneralVLMmodel,andthisportionofdataaccountsforonly20%ofthetotaldata.Weintroducesuchtypeofdatamainlytopreservethegeneralvisioninterface,sothatresearchersinterestedinourmodelandgeneralvisiontaskcanconvenientlyadvancetheirworkinthefuture.Toensurethemodel’slanguagecapabilities,weintroduced10%ofin-housetext-onlypretraindata,withalldataprocessedtoalengthof8192tokens,whichisalsothesequencelengthforDeepSeek-OCR.Insummary,whentrainingDeepSeek-OCR,OCRdataaccountsfor70%,generalvisiondataaccountsfor20%,andtext-onlydataaccountsfor10%.Ourtrainingpipelineisverysimpleandconsistsmainlyoftwostages:a).TrainingDeepEncoderindependently;b).TrainingtheDeepSeek-OCR.NotethattheGundam-mastermodeisobtainedbycontinuingtrainingonapre-trainedDeepSeek-OCRmodelwith6Msampleddata.Sincethetrainingprotocolisidenticaltoothermodes,weomitthedetaileddescriptionhereafter.FollowingVary[36],weutilizeacompactlanguagemodel[15]andusethenexttokenpredictionframeworktotrainDeepEncoder.Inthisstage,weuseallOCR1.0and2.0dataaforementioned,aswellas100MgeneraldatasampledfromtheLAION[31]dataset.Alldataistrainedfor2epochswithabatchsizeof1280,usingtheAdamW[23]optimizerwithcosineannealingscheduler[22]andalearningrateof5e-5.Thetrainingsequencelengthis4096.AfterDeepEncoderisready,weusedatamentionedinSection3.4totraintheDeepSeek-OCR.withtheentiretrainingprocessconductedontheHAI-LLM[14]platform.Theentiremodelusespipelineparallelism(PP)andisdividedinto4parts,withDeepEncodertakingtwopartsandthedecodertakingtwoparts.ForDeepEncoder,wetreatSAMandthecompressorasthevisiontokenizer,placetheminPP0andfreezetheirparameters,whiletreatingtheCLIPpartasinputembeddinglayerandplaceitinPP1withunfrozenweightsfortraining.Forthelanguagemodelpart,sinceDeepSeek3B-MoEhas12layers,weplace6layerseachonPP2andPP3.Weuse20nodes(eachwith8A100-40GGPUs)fortraining,withadataparallelism(DP)of40andaglobalbatchsizeof640.WeusetheAdamWoptimizerwithastep-basedschedulerandaninitiallearningrateof3e-5.Fortext-onlydata,thetrainingspeedis90Btokens/day,whileformultimodaldata,thetrainingspeedis70Btokens/day.Table2|WetestDeepSeek-OCR’svision-textcompressionratiousingallEnglishdocumentswith600-1300tokensfromtheFox[21]benchmarks.TexttokensrepresentthenumberoftokensaftertokenizingthegroundtruthtextusingDeepSeek-OCR’stokenizer.VisionTokens=64or100respectivelyrepresentthenumberofvisiontokensoutputbyDeepEncoderafterresizinginputimagesto512×512and640×640.VisionTokens=64VisiPrecisionCompressionPrecisionCompressionPages784WeselectFox[21]benchmarkstoverifyDeepSeek-OCR’scompressiofortext-richdocuments,inordertopreliminarilyexplorethefeasibilityandboundariesofcontextsopticalcompression.WeusetheEnglishdocumentportionofFox,tokenizethegroundtruthtextwithDeepSeek-OCR’stokenizer(vocabularysizeofapproximately129k),andselectdocumentswith600-1300tokensfortesting,whichhappenstobe100pages.Sinctexttokensisnotlarge,weonlyneedtotestperformanceinTinyandSmallmodes,whereTinymodecorrespondsto64tokensandSmallmodecorrespondsto100tokens.Weusethepromptparsingtasks.Allmetricsinthetableareeditdistances,wheresmallervaluesindicatebetterperformance."Tokens"representstheaveragenumberofvisiontokensusedperpage,and"?200dpi"meansusing?tztointerpolatetheoriginalimageto200dpi.FortheDeepSeek-OCRmodel,thevaluesinparenthesesinthe"Tokens"columnrepresentvalidvisiontokens,calculatedaccordingtoEquation1.ModelTokens overalltextformulatableorderoveralltextformulatableorderPiplineModelsDolphin[11]-Marker[1]-Mathpix[2]-MinerU-2.1.1[34]-MonkeyOCR-1.2B[18]-PPstructure-v3[9]-End-to-endModelsNougat[6]SmolDocling[25]InternVL2-76B[8]Qwen2.5-VL-7B[5]OLMOCR[28]GOT-OCR2.0[38]OCRFlux-3B[3]-InternVL3-78B[42]Qwen2.5-VL-72B[5]dots.ocr[30]Gemini2.5-Pro[4]-MinerU2.0[34]dots.ocr?200dpi[30]DeepSeekDeepSeek-OCR(end2end)TinySmallLargeGundamGundam-M?200dpiwithoutlayout:"<image>\nFreeOCR."tocontrolthemodel’s

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論