2025AICon全球人工智能開發(fā)與應用大會-上海站:以 KV 緩存為中心的高效長文本方法的優(yōu)化和實踐_第1頁
2025AICon全球人工智能開發(fā)與應用大會-上海站:以 KV 緩存為中心的高效長文本方法的優(yōu)化和實踐_第2頁
2025AICon全球人工智能開發(fā)與應用大會-上海站:以 KV 緩存為中心的高效長文本方法的優(yōu)化和實踐_第3頁
2025AICon全球人工智能開發(fā)與應用大會-上海站:以 KV 緩存為中心的高效長文本方法的優(yōu)化和實踐_第4頁
2025AICon全球人工智能開發(fā)與應用大會-上海站:以 KV 緩存為中心的高效長文本方法的優(yōu)化和實踐_第5頁
已閱讀5頁,還剩51頁未讀 繼續(xù)免費閱讀

下載本文檔

版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領

文檔簡介

演講人:

姜慧強o

ResearchSDEinMicrosoftResearchAsia

(Shanghai)

System-AlgorithmCo-design

Efficientmethodstoaccelerateinference/training01長文本大語言模型的應用和推理挑戰(zhàn)02

當前主流推理優(yōu)化方法與技術03

以KV緩存為中心的大語言模型推理架構04

以KV緩存為中心的高效長文本方法05

總結與展望?Extended

Meeting

Time?Massive

Pages

of

DocsLifelong

PersonalizationEndlessAgentic

History?Lengthy

CodebasesComplex

Reasoning10M

tokens

≈PyTorchrepositorycode

≈LordoftheRingstrilogy

(1ps)

≈500reasoningiterative*?Almost

all

latest

models

can

process

contexts

exceeding

100K

tokens.https://lifearchitect.ai/models/#context-windowsDeepSeek-R1:

Incentivizing

Reasoning

Capability

in

LLMs

via

Reinforcement

Learningo

Long

Prefilling

Latency,30

minutestoprocess

1MtokensonanA100foran

8B

LLM.o

LargeGPU

MemoryConsumption,62GBofGPUmemory

is

requiredfor512

K

tokens

infp16.LargeGPU

MemoryConsumption=>RetrievalAttentionLong

Prefilling

Latency=>MInferenceKVCacheStorage

2

TokensCompute

gen.Prefill

DecodeKeys&ValuesPrefix

CachingMInference

1.0/MMInference:DynamicsparseprefillingRetrieval

AttentionAlignmentbetween

ANNSandAttentionSCBenchExploreboundof

KVcachingLLMLinguaSeries:PromptcompressionSparseAtten.Compress31Prompts當前主流推理優(yōu)化方法與技術

PromptCaching

ContextCaching

PromptCaching(a)Prefixcachingiswidelyusedin(b)Prefix

caching

iswidelyusedLLMframework.inLLMAPI.RadixAttentionAutomatic

Prefix

CachingLLM以KV緩存為中心的大語言模型推理架構Repo-levelCode

Debugging/

Long

-

documentQAMulti-turn

DialogueSelf-play

ReasoningLLMRadixAttentionAutomatic

Prefix

CachingGo

Long-context

methods

are

designed

and

utilized

around

the

KVcache,

butexisting

benchmarksfocusonlyonsingle-request(a)

Long-Context

is

shared

in

(b)Prefixcaching

iswidelyusedin(c)Prefixcachingiswidelyusedreal-worldscenarios.

LLMframework.inLLMAPI.PromptCachingContextCachingPromptCachingscenarios,

ignoring

its

full

lifecycle

in

real-world

use.o

Twotypicalsharedcontext

modes;o

Four

category

long-context

capability;o

12subtasks;o

13

long-context

methods;o

lastwindowquery

region

+A-shape;o

Sub-O(n)

Memory

is

Almost

Infeasible

in

Multi-Turn

Decoding.o

Long-generationscenariosexhibitdistributionshift

issues.issues.以KV緩存為中心的高效長文本方法(a)

Attention

is

sparse.

(b)

Sparsity

of

attention

is

dynamic.

(c)

Dynamically

sparsity

in

decoding.inprefilling

in

decoding?

Figure.

Dynamicsparsity

inattention:(a)Top-k(k=4096)columnscoverasignificant

portionofattention

scores

in

a128kcontext.(b)

Fewerscoresare

retrievedwhen

reusingtop-kindicesacrossexamples,

highlighting

itsdynamic

nature.Visualizations

use

LLaMA-3-8BonasingleA100.Figure.

(c)

Dynamic

sparsity

in

LLaMA-3-8B

decoding

in

KV

retrieval(100k

tokens):

Dynamically

selecting

top-1000tokens

achieves89%recovery,while

static

selection

drops

to

71%.Important

TokensRoPE/

N-gramSinkTokenso

After

pretraining,attentionexhibitsvarioussparse

patterns,

includingA-shape,Vertical-Slash,and

block-sparse

patterns.o

Thesesparse

patternsarefixedforeach

headacross

different

inputs.o

Thespecificsparseelements(e.g.,column

index,slash

index)

dynamically

changedependingonthe

context.LocalWindowsRetrievalTokens?

MInference

utilizes

the

inherent

dynamic

sparsity

found

in

LLMsalongsideoptimizedGPU

kerneldesignto

enhancethe

TTFT.AccelerationDynamic

SparseAttentionCalculationwith

PIT,FlashAttentionPatternSearchKernelAware

Sparse

Pattern

SearchEstimationOnline

EstimationofSparsity

Indices030201Step2:OnlineStep1:Offline>FlashAttnMinferenceGPUs

needed

for

sub-20s

latencyat

1Mtokens60+A1008x8A10030

mins10x3

minsLatencyat

1Mtokenson

a

single

A100FlashAttnMinference?

1.

NIAH

2.

RULERAvgTokens:4K-128K3.

Latency

Benchmarko

Local

tokens

in

temporal

andspatialdimensions

are

evenly

distributedwithintheattentionmap.o

Strideandstarting

positionvarywithcontext,the

horizontalandvertical

linesareevenlyspaced

and

oftensymmetrical.o

1)

Intramodality

consistency;2)

Modality-separated

continuity

MMInference

MMInference

MMInference:

Grid

Head

in

Multi-Modality

MMInference:

Q-Boundary

pattern

MMInference:

2D-Boundary

patternBenchmarko

The

VS

pattern

shifts

to

a

Grid

pattern

when

the

input

transitionsfrom

text

to

visual.SeerAttention:

LearnedGatedFlexPrefill:TopPSampleAttention:topP

+columnSTA

SpargeAttnSparseVideoGen

AdaSpaDiTFastAttn

?ANNS:Approximate

Nearest

NeighborSearch?

Specifically,

using

inner-product

as

the

distance

metric,alsoknown

as

Maximum

Inner

Product

Search(MIPS).?

Evaluation

with

RULER

and

。-Bencho

Queries

and

keys

have

different

distributions

in

attention.

Off-the

-shelf

ANNS

indexes

perform

poorly

on

Q

→K

searches

while

workwell

for

K

→K

searches.?about

30~50%

of

key

vectors

are

required

to

scan

to

maintain

anacceptable

performance.(a)ANNS

index

performance.

(b)

Different

distribution.GPUrequirementreductionbyGPU-CPUCo-Execution:

Lower-End

GPU

+

CPU

=

High-End

GPU

ReducememoryaccessanddatatransferwithOOD-aware

ANNS

index

Enable

128Kinferenceon

RTX4090with5tokens/second;Partial

AttentionQueryVectorRetrievalSearchHotTokensKV

KV

KVOffload

Most

KV

to

CPU

SideCombinePartialAttentionDynamically

Activated

Tokens

Indexed

by

ANNS

indexGPUCPUNearest

KVTokens(dynamically

retrieved)Attention

Output?

One

NVIDIA

RTX4090

(24GB)to

handle

128K

tokens

for

LLMs

with8B

parameters,generating

a

token

in0.188

seconds.24GB

RTX409040GBA100總結與展望Long-context

methodsaredesigned

andutilized

around

theKV

cache,butexistingbenchmarksfocusonly

onsingle-request

scenarios,ignoringitsfull

lifecycleinreal-worlduse.(b)Two

SharedContext

ModesWeproposeSCBench,aKVcache-centricbenchmarkforanalyzinglong-contextmethods,coveringKVcachegeneration,compression,retrieval,andloading.Itincludesfourcapabilitytasksandtwoshared-context

modes,fromwhichwederivethefollowinginsights:

Sub-O(n)memoryisalmostinfeasibleinmulti-turndecoding;

Task

performance

shows

varying

decline

trends;

All

long-context

methods

experience

performance

degradation

as

thecompressionratedecreases; Long-generationscenariosexhibitdistributionshiftissues.SCBench:AKVCache-CentricAnalysisofLong-ContextMethods(c)Overview

ofSCBenchProjectPageCodeMMInference:AcceleratingPre-fillingforLong-ContextVLMsviaModality-AwarePermutationSparseAttentionLong-context

enables

powerful

vision

and

multi-modal

applications,WeproposeMMInference,amodality-awarepermutation-baseddynamicsparseattentionmethodformulti-modalinputs.ItintroducesGrid-ShapeandQ-/2D-Boundarysparseattention,achievingupto

8.3

×speedupwithoutsacrificingperformance.MMInferenceaddressestwokeychallenges:;

Vision-specificinductivebias→handledvia

Grid-Shapeattention;

Modalityboundaryinmixedinputs→addressedbyQ-/2D-Boundaryattention;but

prefillcost

remainsamajor

bottleneck.(c)Q-Boundary

pattern.(d)2D-Boundarypattern.(b)PermutedGridpattern.(a)

Gridpattern.We

build

KVCacheas

aVectorStorageSystem

ina

CPU-GPUco-execution

setup(Fig1)

toacc-elerate

long-context

LLMinferencewithout

modelaccuracy

loss

(Fig3).The

coreof

RetroInfer

is

as

follows:

Fig1.

Architecture

of

RetroInfer

Fig2.

Attention-ware

Design

of

Wave

Index

An

Attention-aWare

VEctor

index

named

wave

Index

(Fig2),

which

adopts

tripartite

attention

approximation

and

accuracy-bound

attention

estimation

to

fit

the

dynamic

sparsity

of

attention

A

wave

buffer,

which

coordinates

KV

cache

placement

and

overlaps

computation

and

data

transfer

across

GPU

and

CPU

to

sustain

high

throughput.Powered

bythewave

indexandwave

buffer,thedecodingthroughput

of

RetroInfer

outperforms

baselines

by

4.5x-10.5x

across

different

context

lengths

while

it

is

the

only

solution

that

matches

full

attention

accuracy.RetroInfer:AVector-StorageApproachforScalableLong-Context

LLM

In

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯系上傳者。文件的所有權益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網僅提供信息存儲空間,僅對用戶上傳內容的表現方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
  • 6. 下載文件中如有侵權或不適當內容,請與我們聯系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

最新文檔

評論

0/150

提交評論