自動駕駛的愿景-語言-行動模型 A Survey onVisi Vision–Language–ActiAo Cno Mmporedheelnss ivfeo Sr uArvuetyonomous Driving_第1頁
自動駕駛的愿景-語言-行動模型 A Survey onVisi Vision–Language–ActiAo Cno Mmporedheelnss ivfeo Sr uArvuetyonomous Driving_第2頁
自動駕駛的愿景-語言-行動模型 A Survey onVisi Vision–Language–ActiAo Cno Mmporedheelnss ivfeo Sr uArvuetyonomous Driving_第3頁
自動駕駛的愿景-語言-行動模型 A Survey onVisi Vision–Language–ActiAo Cno Mmporedheelnss ivfeo Sr uArvuetyonomous Driving_第4頁
自動駕駛的愿景-語言-行動模型 A Survey onVisi Vision–Language–ActiAo Cno Mmporedheelnss ivfeo Sr uArvuetyonomous Driving_第5頁
已閱讀5頁,還剩28頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

ASurveyon

Vision-Language-ActionModelsforAutonomousDriving

SicongJiang*,ZilinHuang*,KanganQian*,ZiangLuo,TianzeZhu,YihongTang,MenglinKongandothers

McGillUniversity

TsinghuaUniversity

XiaomiCorporation

UniversityofWisconsin-Madison

UniversityofMinnesota-TwinCities

*Equalcontribution.1

2

Outline

1.Introduction:FromEnd-to-EndADtoVLA4AD

2.TheVLA4ADArchitecture

3.ProgressofVLA4ADModels

4.Datasets&Benchmarks

5.Training&Evaluation

6.Challenges&FutureDirections

7.Conclusion

3

1.FromEnd-to-EndADtoVLA4AD

(a)End-to-EndAutonomousDriving

?Oneneuralnetworkmapsrawsensors→steering/brake

?Removeshand-craftedperception&planningmodules

?Pros:

Simplerpipeline

Holisticoptimization

?Cons:

Black-box,hardtoaudit

Fragileonlong-tailevents

?Nonatural-languageinterface

→difficulttoexplainorfollowcommands

Figure1.Drivingparadigms:End-to-endModels

AutonomousDriving

4

1.FromEnd-to-EndADtoVLA4AD

(b)Vision-LanguageModelsforAutonomousDriving

?Fusevisionencoder+LLM

?scenecaption,QA,high-levelmanoeuvre

?Pros:

Zero-shottorareobjects

Human-readableexplanations

?Cons:

Actiongapremains

Latency&spatial-awareness

LLMhallucinationsrisk

?Firststeptowardinteractive,explainabledrivingsystems

Figure2.Drivingparadigms:Vision-LanguageModels

forAutonomousDriving

5

1.FromEnd-to-EndADtoVLA4AD

(c)Vision-Language-ActionModelsforAutonomousDriving?Unifiedpolicy:

multimodalencoder+languagetokens+actionhead

?Outputs:

drivingtrajectory/control+textualrationale

?Pros:

UnifiedVision-Language-Actionsystem

Enablesfree-forminstructionfollowing&CoTreasoning

Human-readableexplanations

Improvedrobustnessvscornercases

?OpenIssues:

Runtimegap

Tri-modaldatascarcity

?Demonstratesgreatpotentialfordrivingautonomous

vehicleswithhuman-levelreasoningandclearexplanations

Figure3.Drivingparadigms:Vision-Language-Action

ModelsforAutonomousDriving

6

2.TheVLA4ADArchitecture

InputandOutputParadigm

MultimodalInputs:

?Vision(Cameras):

Capturingthedynamicscene.

?Sensors(LiDAR,Radar):

Providingprecise3Dstructureandvelocity.

?Language(Commands,QA):

ObjectDetection

Lane

Detection

Defininghigh-leveluserintent.

Outputs:

?ControlAction(Low-level):

Directsteering/throttlesignals.

?Plans(Trajectory):

Asequenceoffuturewaypoints.

Steering

Control

Brake

Control

PlanningTrajectory

?Explanations(Combinedwithotheraction):Rationalefordecisions.

Occupancy

2.TheVLA4ADArchitecture

CoreArchitecturalModules

VisionEncoder:

?Self-supervisedbackbones(DINOv2,CLIP).

?BEVprojection&LiDARfusion.

ActionDecoder:

?Autoregressivetokens&diffusionplanners

?HierarchicalController:high-level→PID/MPC.

LanguageProcessor:

?Task-basedfinetunedLLaMA,Qwen,Vicuna,GPTandotherLargeLanguageBaseModels

?Modeloptimizationmethodsareoftenusedtoensurealightweightmodel,suchasLoRAAdapters.

7

Figure4.ArchitecturalParadigmoftheVLAforAutonomousDrivingModel

8

3.ProgressofVLA4ADModels

KeyStagesofVLAModelsforAutonomousDriving

Figure5.TheProgressionfromPassiveExplainerstoActiveReasoningAgents

3.ProgressofVLA4ADModels

RepresentativeVLA4ADModels

1.Pre-VLA:LanguageasExplainer(e.g.,DriveGPT4[1])

?Role:AfrozenLLMprovidespost-hoctextualdescriptionsofthesceneorintendedmaneuvers.

?Limitation:Languageisapassiveoverlay,notintegraltodecision-making,leadingtoasemanticgapandpotentialhallucinations.

2.ModularVLA4AD(e.g.,CoVLA-Agent[2],SafeAuto[3])

?Role:Languagebecomesanactive,intermediate

representationforplanning,oftenvalidatedbysymbolicrules.

?Limitation:Multi-stagepipelinesintroducelatencyandarepronetocascadingfailuresatmoduleboundaries.

Figure6.DriveGPT4(2024)-InterpretableLLMforAD

Figure7.CoVLA-Agent(2025)TrainedwithCoVLADataset

[1]Xu,Zhenhua,etal."Drivegpt4:Interpretableend-to-endautonomousdrivingvialargelanguagemodel."IEEERoboticsandAutomationLetters(2024).

9

[2]Arai,Hidehisa,etal."Covla:Comprehensivevision-language-actiondatasetforautonomousdriving."2025IEEE/CVFWACV.IEEE,2025.

[3]Zhang,Jiawei,etal."Safeauto:Knowledge-enhancedsafeautonomousdrivingwithmultimodalfoundationmodels."arXivpreprintarXiv:2503.00211(2025).

3.ProgressofVLA4ADModels

RepresentativeVLA4ADModels

3.UnifiedEnd-to-EndVLA(e.g.,EMMA[1])

?Role:Asingle,unifiednetworkmapsmultimodalsensorinputsdirectlytocontrolactionsortrajectories.

?Limitation:Whilereactive,thesemodelscanstrugglewithlong-horizonreasoningandcomplex,multi-stepplanning.

4.Reasoning-AugmentedVLA4AD(e.g.,ORION[2],AutoVLA[3])

?Role:Languagemodelsarecentraltothecontrolloop,

enablinglong-termmemoryandChain-of-Thoughtreasoningbeforeacting.

?Status:Showpromisingresultsinlong-termreasoningand

interaction,butinferencedelaycouldbeapotentialproblem.

Figure8.EMMA(2024)-End-to-EndMultimodalModelforAD

Figure9.AutoVLA(2025)-VLA4ADwithRL&AdaptiveCoT

[1]Hwang,Jyh-Jing,etal."Emma:End-to-endmultimodalmodelforautonomousdriving."TMLR,2025

10

[2]FuH,ZhangD,ZhaoZ,etal.Orion:Aholisticend-to-endautonomousdrivingframeworkbyvision-languageinstructedactiongeneration[J].arXiv:2503.19755,2025.

[3]ZhouZ,CaiT,ZhaoSZ,etal.AutoVLA:AVision-Language-ActionModelforEnd-to-EndAutonomousDrivingwithAdaptiveReasoningandReinforcementFine-Tuning[J].arXiv:2506.13757,2025.

3.ProgressofVLA4ADModels

MainstreamVLA4ADModelStructures

Table1.ListofVLA4ADModels(2023-2025).

SensorInputs:

Single=singleforward-facingcamerainput;

Multi=multi-viewcamerainput;

State=vehiclestateinformation&othersensorinput.

Outputs:

LLC=low-levelcontrol,Traj.=futuretrajectory,

Multi.=multipletaskssuchasperception,prediction

11

orplanning.

4.Dataset&Benchmark

High-qualityanddiversedatasets/benchmarksarethecornerstoneofVLAresearch.

?Large-scalereal-worlddata(suchasnuScenes,BDD-X),providingrichmulti-sensorinformationandhumandrivingexplanations.

?Keyscenariosandsafetytests(suchasImpromptuVLA,Bench2Drive),focusingonthe"longtail"andedgecasesthatarecriticaltosafety.

12

?Fine-grainedreasoningdata(suchasReason2Drive,DriveLM),providingstructuredlanguageannotationsfortrainingmodels'complexreasoningcapabilities.

5.Training&Evaluation

TrainingParadigm

1.Pre-train:vision+languagebackbonesonimage-textcorpora.

2.Finetune:tri-modalimitationorsupervisedfinetuning(image,text,control).

3.Augment:RLfine-tuneoncorner-case.

13

Figure10.TrainingPipelineofVLA4AD

5.Training&Evaluation

EvaluationMetrics

DrivingMetrics

Closed-loopsuccessrate,Infractions,Latency.

LanguageMetrics

BLEU/CIDEr(NuInteract[1]),CoTConsistency(Reason2Drive[2]).

RobustnessStressors

Sensornoise,Adversarialprompts,OODweather(DriveBench[3]).

Figure11.IllustrationofDriveBench(2025)Benchmark

[1]Zhao,Zongchuang,etal."ExtendingLargeVision-LanguageModelforDiverseInteractiveTasksinAutonomousDriving."arXivpreprintarXiv:2505.08725(2025).

14

[2]Nie,Ming,etal."Reason2drive:Towardsinterpretableandchain-basedreasoningforautonomousdriving."ECCV2024..

[3]XieS,KongL,DongY,etal.AreVLMsReadyforAutonomousDriving?AnEmpiricalStudyfromtheReliability,Data,andMetricPerspectives[J].arXiv:2501.04003,2025.

15

6.Challenges&FutureDirections

OpenChallenges

?Robustness&Reliability:

HowtocounterLLMhallucinationsandensure

stabilityundersensorcorruptionandlinguisticnoise?

?Real-timePerformance:

Howtoexecutebillion-parametermodelson

automotivehardwareat≥30Hz?(quantization,distillation,MoE).

?DataBottlenecks:

Thescarcityofhigh-quality,large-scaletri-modal(Vision+Language+Action)dataisamajorhurdle.

FutureDirections

?FoundationDrivingModels:

Aself-supervised,multi-sensor"drivingbackbone"(GPT)trainedonpetabytesofdrivingdata.

?Neuro-SymbolicSafety:

Hybridsystemsthatcombineneuralflexibilitywithaverifiablelogicalsafetykernel.

?Fleet-scaleContinualLearning:

Vehiclessharingknowledge(e.g.,concise

languagesnip

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論