版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)
文檔簡介
ASurveyon
Vision-Language-ActionModelsforAutonomousDriving
SicongJiang*,ZilinHuang*,KanganQian*,ZiangLuo,TianzeZhu,YihongTang,MenglinKongandothers
McGillUniversity
TsinghuaUniversity
XiaomiCorporation
UniversityofWisconsin-Madison
UniversityofMinnesota-TwinCities
*Equalcontribution.1
2
Outline
1.Introduction:FromEnd-to-EndADtoVLA4AD
2.TheVLA4ADArchitecture
3.ProgressofVLA4ADModels
4.Datasets&Benchmarks
5.Training&Evaluation
6.Challenges&FutureDirections
7.Conclusion
3
1.FromEnd-to-EndADtoVLA4AD
(a)End-to-EndAutonomousDriving
?Oneneuralnetworkmapsrawsensors→steering/brake
?Removeshand-craftedperception&planningmodules
?Pros:
Simplerpipeline
Holisticoptimization
?Cons:
Black-box,hardtoaudit
Fragileonlong-tailevents
?Nonatural-languageinterface
→difficulttoexplainorfollowcommands
Figure1.Drivingparadigms:End-to-endModels
AutonomousDriving
4
1.FromEnd-to-EndADtoVLA4AD
(b)Vision-LanguageModelsforAutonomousDriving
?Fusevisionencoder+LLM
?scenecaption,QA,high-levelmanoeuvre
?Pros:
Zero-shottorareobjects
Human-readableexplanations
?Cons:
Actiongapremains
Latency&spatial-awareness
LLMhallucinationsrisk
?Firststeptowardinteractive,explainabledrivingsystems
Figure2.Drivingparadigms:Vision-LanguageModels
forAutonomousDriving
5
1.FromEnd-to-EndADtoVLA4AD
(c)Vision-Language-ActionModelsforAutonomousDriving?Unifiedpolicy:
multimodalencoder+languagetokens+actionhead
?Outputs:
drivingtrajectory/control+textualrationale
?Pros:
UnifiedVision-Language-Actionsystem
Enablesfree-forminstructionfollowing&CoTreasoning
Human-readableexplanations
Improvedrobustnessvscornercases
?OpenIssues:
Runtimegap
Tri-modaldatascarcity
?Demonstratesgreatpotentialfordrivingautonomous
vehicleswithhuman-levelreasoningandclearexplanations
Figure3.Drivingparadigms:Vision-Language-Action
ModelsforAutonomousDriving
6
2.TheVLA4ADArchitecture
InputandOutputParadigm
MultimodalInputs:
?Vision(Cameras):
Capturingthedynamicscene.
?Sensors(LiDAR,Radar):
Providingprecise3Dstructureandvelocity.
?Language(Commands,QA):
ObjectDetection
Lane
Detection
Defininghigh-leveluserintent.
Outputs:
?ControlAction(Low-level):
Directsteering/throttlesignals.
?Plans(Trajectory):
Asequenceoffuturewaypoints.
Steering
Control
Brake
Control
PlanningTrajectory
?Explanations(Combinedwithotheraction):Rationalefordecisions.
Occupancy
2.TheVLA4ADArchitecture
CoreArchitecturalModules
VisionEncoder:
?Self-supervisedbackbones(DINOv2,CLIP).
?BEVprojection&LiDARfusion.
ActionDecoder:
?Autoregressivetokens&diffusionplanners
?HierarchicalController:high-level→PID/MPC.
LanguageProcessor:
?Task-basedfinetunedLLaMA,Qwen,Vicuna,GPTandotherLargeLanguageBaseModels
?Modeloptimizationmethodsareoftenusedtoensurealightweightmodel,suchasLoRAAdapters.
7
Figure4.ArchitecturalParadigmoftheVLAforAutonomousDrivingModel
8
3.ProgressofVLA4ADModels
KeyStagesofVLAModelsforAutonomousDriving
Figure5.TheProgressionfromPassiveExplainerstoActiveReasoningAgents
3.ProgressofVLA4ADModels
RepresentativeVLA4ADModels
1.Pre-VLA:LanguageasExplainer(e.g.,DriveGPT4[1])
?Role:AfrozenLLMprovidespost-hoctextualdescriptionsofthesceneorintendedmaneuvers.
?Limitation:Languageisapassiveoverlay,notintegraltodecision-making,leadingtoasemanticgapandpotentialhallucinations.
2.ModularVLA4AD(e.g.,CoVLA-Agent[2],SafeAuto[3])
?Role:Languagebecomesanactive,intermediate
representationforplanning,oftenvalidatedbysymbolicrules.
?Limitation:Multi-stagepipelinesintroducelatencyandarepronetocascadingfailuresatmoduleboundaries.
Figure6.DriveGPT4(2024)-InterpretableLLMforAD
Figure7.CoVLA-Agent(2025)TrainedwithCoVLADataset
[1]Xu,Zhenhua,etal."Drivegpt4:Interpretableend-to-endautonomousdrivingvialargelanguagemodel."IEEERoboticsandAutomationLetters(2024).
9
[2]Arai,Hidehisa,etal."Covla:Comprehensivevision-language-actiondatasetforautonomousdriving."2025IEEE/CVFWACV.IEEE,2025.
[3]Zhang,Jiawei,etal."Safeauto:Knowledge-enhancedsafeautonomousdrivingwithmultimodalfoundationmodels."arXivpreprintarXiv:2503.00211(2025).
3.ProgressofVLA4ADModels
RepresentativeVLA4ADModels
3.UnifiedEnd-to-EndVLA(e.g.,EMMA[1])
?Role:Asingle,unifiednetworkmapsmultimodalsensorinputsdirectlytocontrolactionsortrajectories.
?Limitation:Whilereactive,thesemodelscanstrugglewithlong-horizonreasoningandcomplex,multi-stepplanning.
4.Reasoning-AugmentedVLA4AD(e.g.,ORION[2],AutoVLA[3])
?Role:Languagemodelsarecentraltothecontrolloop,
enablinglong-termmemoryandChain-of-Thoughtreasoningbeforeacting.
?Status:Showpromisingresultsinlong-termreasoningand
interaction,butinferencedelaycouldbeapotentialproblem.
Figure8.EMMA(2024)-End-to-EndMultimodalModelforAD
Figure9.AutoVLA(2025)-VLA4ADwithRL&AdaptiveCoT
[1]Hwang,Jyh-Jing,etal."Emma:End-to-endmultimodalmodelforautonomousdriving."TMLR,2025
10
[2]FuH,ZhangD,ZhaoZ,etal.Orion:Aholisticend-to-endautonomousdrivingframeworkbyvision-languageinstructedactiongeneration[J].arXiv:2503.19755,2025.
[3]ZhouZ,CaiT,ZhaoSZ,etal.AutoVLA:AVision-Language-ActionModelforEnd-to-EndAutonomousDrivingwithAdaptiveReasoningandReinforcementFine-Tuning[J].arXiv:2506.13757,2025.
3.ProgressofVLA4ADModels
MainstreamVLA4ADModelStructures
Table1.ListofVLA4ADModels(2023-2025).
SensorInputs:
Single=singleforward-facingcamerainput;
Multi=multi-viewcamerainput;
State=vehiclestateinformation&othersensorinput.
Outputs:
LLC=low-levelcontrol,Traj.=futuretrajectory,
Multi.=multipletaskssuchasperception,prediction
11
orplanning.
4.Dataset&Benchmark
High-qualityanddiversedatasets/benchmarksarethecornerstoneofVLAresearch.
?Large-scalereal-worlddata(suchasnuScenes,BDD-X),providingrichmulti-sensorinformationandhumandrivingexplanations.
?Keyscenariosandsafetytests(suchasImpromptuVLA,Bench2Drive),focusingonthe"longtail"andedgecasesthatarecriticaltosafety.
12
?Fine-grainedreasoningdata(suchasReason2Drive,DriveLM),providingstructuredlanguageannotationsfortrainingmodels'complexreasoningcapabilities.
5.Training&Evaluation
TrainingParadigm
1.Pre-train:vision+languagebackbonesonimage-textcorpora.
2.Finetune:tri-modalimitationorsupervisedfinetuning(image,text,control).
3.Augment:RLfine-tuneoncorner-case.
13
Figure10.TrainingPipelineofVLA4AD
5.Training&Evaluation
EvaluationMetrics
DrivingMetrics
Closed-loopsuccessrate,Infractions,Latency.
LanguageMetrics
BLEU/CIDEr(NuInteract[1]),CoTConsistency(Reason2Drive[2]).
RobustnessStressors
Sensornoise,Adversarialprompts,OODweather(DriveBench[3]).
Figure11.IllustrationofDriveBench(2025)Benchmark
[1]Zhao,Zongchuang,etal."ExtendingLargeVision-LanguageModelforDiverseInteractiveTasksinAutonomousDriving."arXivpreprintarXiv:2505.08725(2025).
14
[2]Nie,Ming,etal."Reason2drive:Towardsinterpretableandchain-basedreasoningforautonomousdriving."ECCV2024..
[3]XieS,KongL,DongY,etal.AreVLMsReadyforAutonomousDriving?AnEmpiricalStudyfromtheReliability,Data,andMetricPerspectives[J].arXiv:2501.04003,2025.
15
6.Challenges&FutureDirections
OpenChallenges
?Robustness&Reliability:
HowtocounterLLMhallucinationsandensure
stabilityundersensorcorruptionandlinguisticnoise?
?Real-timePerformance:
Howtoexecutebillion-parametermodelson
automotivehardwareat≥30Hz?(quantization,distillation,MoE).
?DataBottlenecks:
Thescarcityofhigh-quality,large-scaletri-modal(Vision+Language+Action)dataisamajorhurdle.
FutureDirections
?FoundationDrivingModels:
Aself-supervised,multi-sensor"drivingbackbone"(GPT)trainedonpetabytesofdrivingdata.
?Neuro-SymbolicSafety:
Hybridsystemsthatcombineneuralflexibilitywithaverifiablelogicalsafetykernel.
?Fleet-scaleContinualLearning:
Vehiclessharingknowledge(e.g.,concise
languagesnip
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026重慶市萬州區(qū)燕山鄉(xiāng)人民政府招聘全日制公益性崗位1人備考考試試題附答案解析
- 生產(chǎn)企業(yè)黑名單制度
- 2026年河北承德市教育局公開選聘急需緊缺學(xué)科教師39名參考考試題庫附答案解析
- 2026第一季度四川成都市青白江區(qū)第三人民醫(yī)院自主招聘醫(yī)師、護士3人備考考試題庫附答案解析
- 2026年西安市城南中學(xué)招聘參考考試題庫附答案解析
- 2026云南保山市騰沖出入境邊防檢查站執(zhí)勤隊口岸協(xié)管(檢)員招聘1人參考考試題庫附答案解析
- 2026時代北汽(北京)新能源科技有限公司 (正式工)招聘參考考試題庫附答案解析
- 2025年廣東省輔警(協(xié)警)招聘考試題庫及答案
- 行政文秘筆試題庫及答案
- 住建局安全生產(chǎn)例會制度
- 北京通州產(chǎn)業(yè)服務(wù)有限公司招聘參考題庫必考題
- 兒科MDT臨床技能情景模擬培訓(xùn)體系
- 【高三上】2026屆12月八省聯(lián)考(T8聯(lián)考)語文試題含答案
- 護理不良事件根本原因分析
- 社會心理學(xué)考試題及答案
- 醫(yī)療器械經(jīng)營企業(yè)質(zhì)量管理體系文件(2025版)(全套)
- 出鐵廠鐵溝澆注施工方案
- 2025年中小學(xué)教師正高級職稱評聘答辯試題(附答案)
- 現(xiàn)代企業(yè)管理體系架構(gòu)及運作模式
- 古建筑設(shè)計工作室創(chuàng)業(yè)
- 公司酶制劑發(fā)酵工工藝技術(shù)規(guī)程
評論
0/150
提交評論