版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
演講人:
姜慧強o
ResearchSDEinMicrosoftResearchAsia
(Shanghai)
System-AlgorithmCo-design
Efficientmethodstoaccelerateinference/training01長文本大語言模型的應用和推理挑戰(zhàn)02
當前主流推理優(yōu)化方法與技術03
以KV緩存為中心的大語言模型推理架構04
以KV緩存為中心的高效長文本方法05
總結與展望?Extended
Meeting
Time?Massive
Pages
of
DocsLifelong
PersonalizationEndlessAgentic
History?Lengthy
CodebasesComplex
Reasoning10M
tokens
≈PyTorchrepositorycode
≈LordoftheRingstrilogy
(1ps)
≈500reasoningiterative*?Almost
all
latest
models
can
process
contexts
exceeding
100K
tokens.https://lifearchitect.ai/models/#context-windowsDeepSeek-R1:
Incentivizing
Reasoning
Capability
in
LLMs
via
Reinforcement
Learningo
Long
Prefilling
Latency,30
minutestoprocess
1MtokensonanA100foran
8B
LLM.o
LargeGPU
MemoryConsumption,62GBofGPUmemory
is
requiredfor512
K
tokens
infp16.LargeGPU
MemoryConsumption=>RetrievalAttentionLong
Prefilling
Latency=>MInferenceKVCacheStorage
2
TokensCompute
gen.Prefill
DecodeKeys&ValuesPrefix
CachingMInference
1.0/MMInference:DynamicsparseprefillingRetrieval
AttentionAlignmentbetween
ANNSandAttentionSCBenchExploreboundof
KVcachingLLMLinguaSeries:PromptcompressionSparseAtten.Compress31Prompts當前主流推理優(yōu)化方法與技術
PromptCaching
ContextCaching
PromptCaching(a)Prefixcachingiswidelyusedin(b)Prefix
caching
iswidelyusedLLMframework.inLLMAPI.RadixAttentionAutomatic
Prefix
CachingLLM以KV緩存為中心的大語言模型推理架構Repo-levelCode
Debugging/
Long
-
documentQAMulti-turn
DialogueSelf-play
ReasoningLLMRadixAttentionAutomatic
Prefix
CachingGo
Long-context
methods
are
designed
and
utilized
around
the
KVcache,
butexisting
benchmarksfocusonlyonsingle-request(a)
Long-Context
is
shared
in
(b)Prefixcaching
iswidelyusedin(c)Prefixcachingiswidelyusedreal-worldscenarios.
LLMframework.inLLMAPI.PromptCachingContextCachingPromptCachingscenarios,
ignoring
its
full
lifecycle
in
real-world
use.o
Twotypicalsharedcontext
modes;o
Four
category
long-context
capability;o
12subtasks;o
13
long-context
methods;o
lastwindowquery
region
+A-shape;o
Sub-O(n)
Memory
is
Almost
Infeasible
in
Multi-Turn
Decoding.o
Long-generationscenariosexhibitdistributionshift
issues.issues.以KV緩存為中心的高效長文本方法(a)
Attention
is
sparse.
(b)
Sparsity
of
attention
is
dynamic.
(c)
Dynamically
sparsity
in
decoding.inprefilling
in
decoding?
Figure.
Dynamicsparsity
inattention:(a)Top-k(k=4096)columnscoverasignificant
portionofattention
scores
in
a128kcontext.(b)
Fewerscoresare
retrievedwhen
reusingtop-kindicesacrossexamples,
highlighting
itsdynamic
nature.Visualizations
use
LLaMA-3-8BonasingleA100.Figure.
(c)
Dynamic
sparsity
in
LLaMA-3-8B
decoding
in
KV
retrieval(100k
tokens):
Dynamically
selecting
top-1000tokens
achieves89%recovery,while
static
selection
drops
to
71%.Important
TokensRoPE/
N-gramSinkTokenso
After
pretraining,attentionexhibitsvarioussparse
patterns,
includingA-shape,Vertical-Slash,and
block-sparse
patterns.o
Thesesparse
patternsarefixedforeach
headacross
different
inputs.o
Thespecificsparseelements(e.g.,column
index,slash
index)
dynamically
changedependingonthe
context.LocalWindowsRetrievalTokens?
MInference
utilizes
the
inherent
dynamic
sparsity
found
in
LLMsalongsideoptimizedGPU
kerneldesignto
enhancethe
TTFT.AccelerationDynamic
SparseAttentionCalculationwith
PIT,FlashAttentionPatternSearchKernelAware
Sparse
Pattern
SearchEstimationOnline
EstimationofSparsity
Indices030201Step2:OnlineStep1:Offline>FlashAttnMinferenceGPUs
needed
for
sub-20s
latencyat
1Mtokens60+A1008x8A10030
mins10x3
minsLatencyat
1Mtokenson
a
single
A100FlashAttnMinference?
1.
NIAH
2.
RULERAvgTokens:4K-128K3.
Latency
Benchmarko
Local
tokens
in
temporal
andspatialdimensions
are
evenly
distributedwithintheattentionmap.o
Strideandstarting
positionvarywithcontext,the
horizontalandvertical
linesareevenlyspaced
and
oftensymmetrical.o
1)
Intramodality
consistency;2)
Modality-separated
continuity
MMInference
MMInference
MMInference:
Grid
Head
in
Multi-Modality
MMInference:
Q-Boundary
pattern
MMInference:
2D-Boundary
patternBenchmarko
The
VS
pattern
shifts
to
a
Grid
pattern
when
the
input
transitionsfrom
text
to
visual.SeerAttention:
LearnedGatedFlexPrefill:TopPSampleAttention:topP
+columnSTA
SpargeAttnSparseVideoGen
AdaSpaDiTFastAttn
?ANNS:Approximate
Nearest
NeighborSearch?
Specifically,
using
inner-product
as
the
distance
metric,alsoknown
as
Maximum
Inner
Product
Search(MIPS).?
Evaluation
with
RULER
and
。-Bencho
Queries
and
keys
have
different
distributions
in
attention.
Off-the
-shelf
ANNS
indexes
perform
poorly
on
Q
→K
searches
while
workwell
for
K
→K
searches.?about
30~50%
of
key
vectors
are
required
to
scan
to
maintain
anacceptable
performance.(a)ANNS
index
performance.
(b)
Different
distribution.GPUrequirementreductionbyGPU-CPUCo-Execution:
Lower-End
GPU
+
CPU
=
High-End
GPU
ReducememoryaccessanddatatransferwithOOD-aware
ANNS
index
Enable
128Kinferenceon
RTX4090with5tokens/second;Partial
AttentionQueryVectorRetrievalSearchHotTokensKV
KV
KVOffload
Most
KV
to
CPU
SideCombinePartialAttentionDynamically
Activated
Tokens
Indexed
by
ANNS
indexGPUCPUNearest
KVTokens(dynamically
retrieved)Attention
Output?
One
NVIDIA
RTX4090
(24GB)to
handle
128K
tokens
for
LLMs
with8B
parameters,generating
a
token
in0.188
seconds.24GB
RTX409040GBA100總結與展望Long-context
methodsaredesigned
andutilized
around
theKV
cache,butexistingbenchmarksfocusonly
onsingle-request
scenarios,ignoringitsfull
lifecycleinreal-worlduse.(b)Two
SharedContext
ModesWeproposeSCBench,aKVcache-centricbenchmarkforanalyzinglong-contextmethods,coveringKVcachegeneration,compression,retrieval,andloading.Itincludesfourcapabilitytasksandtwoshared-context
modes,fromwhichwederivethefollowinginsights:
Sub-O(n)memoryisalmostinfeasibleinmulti-turndecoding;
Task
performance
shows
varying
decline
trends;
All
long-context
methods
experience
performance
degradation
as
thecompressionratedecreases; Long-generationscenariosexhibitdistributionshiftissues.SCBench:AKVCache-CentricAnalysisofLong-ContextMethods(c)Overview
ofSCBenchProjectPageCodeMMInference:AcceleratingPre-fillingforLong-ContextVLMsviaModality-AwarePermutationSparseAttentionLong-context
enables
powerful
vision
and
multi-modal
applications,WeproposeMMInference,amodality-awarepermutation-baseddynamicsparseattentionmethodformulti-modalinputs.ItintroducesGrid-ShapeandQ-/2D-Boundarysparseattention,achievingupto
8.3
×speedupwithoutsacrificingperformance.MMInferenceaddressestwokeychallenges:;
Vision-specificinductivebias→handledvia
Grid-Shapeattention;
Modalityboundaryinmixedinputs→addressedbyQ-/2D-Boundaryattention;but
prefillcost
remainsamajor
bottleneck.(c)Q-Boundary
pattern.(d)2D-Boundarypattern.(b)PermutedGridpattern.(a)
Gridpattern.We
build
KVCacheas
aVectorStorageSystem
ina
CPU-GPUco-execution
setup(Fig1)
toacc-elerate
long-context
LLMinferencewithout
modelaccuracy
loss
(Fig3).The
coreof
RetroInfer
is
as
follows:
Fig1.
Architecture
of
RetroInfer
Fig2.
Attention-ware
Design
of
Wave
Index
An
Attention-aWare
VEctor
index
named
wave
Index
(Fig2),
which
adopts
tripartite
attention
approximation
and
accuracy-bound
attention
estimation
to
fit
the
dynamic
sparsity
of
attention
A
wave
buffer,
which
coordinates
KV
cache
placement
and
overlaps
computation
and
data
transfer
across
GPU
and
CPU
to
sustain
high
throughput.Powered
bythewave
indexandwave
buffer,thedecodingthroughput
of
RetroInfer
outperforms
baselines
by
4.5x-10.5x
across
different
context
lengths
while
it
is
the
only
solution
that
matches
full
attention
accuracy.RetroInfer:AVector-StorageApproachforScalableLong-Context
LLM
In
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026四川南充營華物業(yè)管理有限公司招聘工作人員28人筆試模擬試題及答案解析
- 2025山東大學晶體材料研究院(晶體材料全國重點實驗室)非事業(yè)編制人員招聘1人考試備考題庫附答案
- 2025年齊齊哈爾泰來縣城鎮(zhèn)建設服務中心公開招聘市政園林養(yǎng)護工作人員3人備考題庫附答案
- 2025年天地(榆林)開采工程技術有限公司招聘(3人)考試參考題庫附答案
- 2025廣東省清遠市清城區(qū)下半年招聘中學教師45人備考題庫附答案
- 2025山東青島上合臨空控股發(fā)展集團有限公司社會招聘5人考前自測高頻考點模擬試題附答案
- 2025年舟山市定海區(qū)醫(yī)療集團赴浙江中醫(yī)藥大學招聘醫(yī)學類畢業(yè)生2人(公共基礎知識)測試題附答案
- 2026貴州黔南州惠水縣廉潔征兵監(jiān)督員筆試模擬試題及答案解析
- 2026安徽醫(yī)科大學臨床醫(yī)學院人才招聘124人筆試備考試題及答案解析
- 制造企業(yè)年度生產總結【演示文檔課件】
- 動火作業(yè)施工方案5篇
- 2024年重慶市優(yōu)質企業(yè)梯度培育政策解讀學習培訓課件資料(專精特新 專精特新小巨人中小企業(yè) 注意事項)
- 老年人高血壓的護理
- 糧油產品授權書
- 責任督學培訓課件
- 關于安吉物流市場的調查報告
- 抑郁病診斷證明書
- 心電監(jiān)測技術操作考核評分標準
- 歷史時空觀念的教學與評價
- 維克多高中英語3500詞匯
- 第五屆全國輔導員職業(yè)能力大賽案例分析與談心談話試題(附答案)
評論
0/150
提交評論