向量處理課件_第1頁(yè)
向量處理課件_第2頁(yè)
向量處理課件_第3頁(yè)
向量處理課件_第4頁(yè)
向量處理課件_第5頁(yè)
已閱讀5頁(yè),還剩47頁(yè)未讀, 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶(hù)提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

第七章向量處理主要參考:ProfessorDavidA.Patterson的講稿傳統(tǒng)指令級(jí)并行技術(shù)的問(wèn)題挖掘ILP的傳統(tǒng)方法的主要缺陷: 1)提高流水線(xiàn)的時(shí)鐘頻率:提高時(shí)鐘頻率,有時(shí)導(dǎo)致CPI隨著增加(branches,otherhazards) 2)指令預(yù)取和譯碼:有時(shí)在每個(gè)時(shí)鐘周期很難預(yù)取和譯碼多條指令 3)提高Cache命中率:在有些計(jì)算量較大的應(yīng)用中(科學(xué)計(jì)算)需要大量的數(shù)據(jù),其局部性較差,有些程序處理的是連續(xù)的媒體流(multimedia),其局部性也較差。AlternativeModel:VectorProcessing+r1r2r3addr3,r1,r2SCALAR(1operation)v1v2v3+vectorlengthadd.vvv3,v1,v2VECTOR(Noperations)向量處理機(jī)具有更高層次的操作,一條向量指令可以處理N個(gè)或N對(duì)操作數(shù)(處理對(duì)象是向量)Spec92fp Operations(Millions) Instructions(M)ProgramRISCVectorR/VRISCVectorR/V

swim256 115 95 1.1x 115 0.8 142xhydro2d 58 40 1.4x 58 0.8 71xnasa7 69 41 1.7x 69 2.2 31xsu2cor 51 35 1.4x 51 1.8 29xtomcatv 15 10 1.4x 15 1.3 11xwave5 27 25 1.1x 27 7.2 4xmdljdp2 32 52 0.6x 32 15.8 2xOperation&InstructionCount:

RISCv.VectorProcessor

(fromF.Quintana,U.Barcelona.)

Vectorreducesopsby1.2X,instructionsby20X向量處理機(jī)的基本結(jié)構(gòu)memory-memoryvectorprocessors:所有的向量操作是存儲(chǔ)器到存儲(chǔ)器的vector-registerprocessors:除了load和store操作外,所有的操作是向量寄存器與向量寄存器間的操作

向量機(jī)的Load/Store結(jié)構(gòu) 1980年以后的所有的向量處理機(jī)都是這種結(jié)構(gòu):

Cray,Convex,Fujitsu,Hitachi,NEC

下面我們也主要針對(duì)這種結(jié)構(gòu)向量處理機(jī)的基本組成單元VectorRegister:固定長(zhǎng)度的一塊區(qū)域,存放單個(gè)向量

至少2個(gè)讀端口和一個(gè)寫(xiě)端口

典型的有8-32向量寄存器,每個(gè)寄存器存放64到128個(gè)64位的元素VectorFunctionalUnits(FUs):全流水化的,每一個(gè)clock啟動(dòng)一個(gè)新的操作

一般4到8個(gè)FUs:FPadd,FPmult,FPreciprocal(1/X),integeradd,logical,shift;可能有些重復(fù)設(shè)置的部件VectorLoad-StoreUnits(LSUs):全流水化地load或store一個(gè)向量,可能會(huì)配置多個(gè)LSU部件Scalarregisters:存放單個(gè)元素用于標(biāo)量處理或存儲(chǔ)地址用交叉開(kāi)關(guān)連接(Cross-bartoconnect)FUs,LSUs,registers32MemoryoperationsLoad/store操作成組地在寄存器和存儲(chǔ)器之間移動(dòng)數(shù)據(jù)三類(lèi)尋址方式Unitstride(單步長(zhǎng))FastestNon-unit

(constant)stride(常數(shù)步長(zhǎng))Indexed(gather-scatter)(間接尋址)等價(jià)于寄存器間接尋址方式對(duì)稀疏矩陣有效用于向量化操作的指令增多Example1DAXPY(Y=a

*

X+Y)

LD F0,a ADDI R4,Rx,#512 ;lastaddresstoloadloop:LD F2,0(Rx) ;loadX(i) MULTD F2,F0,F2 ;a*X(i) LD F4,0(Ry) ;loadY(i) ADDD F4,F2,F4 ;a*X(i)+Y(i) SD F4

,0(Ry) ;storeintoY(i) ADDI Rx,Rx,#8 ;incrementindextoX ADDI Ry,Ry,#8 ;incrementindextoY SUB R20,R4,Rx ;computebound BNZ R20,loop ;checkifdone

LD F0,a ;loadscalara LV V1,Rx ;loadvectorX MULTS V2,F0,V1 ;vector-scalarmult. LV V3,Ry ;loadvectorY ADDV V4,V2,V3 ;add SV Ry,V4 ;storetheresultAssumingvectorsX,Yarelength64Scalarvs.Vector578(2+9*64)vs.

321(1+5*64)ops(1.8X)578(2+9*64)vs.

6instructions(96X)64operationvectors+noloopoverheadalso64XfewerpipelinehazardsVectorLinpackPerformance(MFLOPS)

Machine Year Clock 100x100 1kx1k Peak(Procs)Cray1 1976 80MHz 12 110 160(1)CrayXMP 1983 120MHz 121 218 940(4)CrayYMP 1988 166MHz 150 307 2,667(8)CrayC-90 1991 240MHz 387 902 15,238(16)CrayT-90 1996 455MHz 705 1603 57,600(32)Conv.C-1 1984 10MHz 3 -- 20(1)Conv.C-4 1994 135MHz 160 2531 3240(4)Fuj.VP200 1982 133MHz 18 422 533(1)NECSX/2 1984 166MHz 43 885 1300(1)NECSX/3 1995 400MHz 368 2757 25,600(4)MatrixInverse(gaussianelimination)review向量處理機(jī)基本特征把兩個(gè)向量的對(duì)應(yīng)分量進(jìn)行運(yùn)算,產(chǎn)生一個(gè)結(jié)果向量向量處理機(jī)基本結(jié)構(gòu)memory-memoryvectorprocessorsvector-registerprocessors向量處理機(jī)基本組成VectorRegisterVectorFunctionalUnits(FUs)VectorLoad-StoreUnits(LSUs)

Scalarregisters用交叉開(kāi)關(guān)連接(CrossBar)VectorExecutionTimeTime=f(vectorlength,datadependencies,struct.hazards)Initiationrate:功能部件消耗向量元素的速率

(=numberoflanes;usually1or2onCrayT-90)Convoy:可在同一時(shí)鐘周期開(kāi)始執(zhí)行的指令集合(nostructuralordatahazards)Chime:執(zhí)行一個(gè)convoy所花費(fèi)的大致時(shí)間(approx.time)mconvoystakemchimes;如果每個(gè)向量長(zhǎng)度為n,那么m個(gè)convoys所花費(fèi)的時(shí)間是m個(gè)chimes,每個(gè)chime所花費(fèi)的時(shí)間是n個(gè)clocks,該程序所花費(fèi)的總時(shí)間大約為mxnclockcycles(ignoresoverhead;goodapproximizationforlongvectors)4convoys,1lane,VL=64=>4x64=256clocks(or4clocksperresult)1: LV V1,Rx ;loadvectorX2: MULV V2,F0,V1 ;vector-scalarmult. LV V3,Ry ;loadvectorY3: ADDV V4,V2,V3 ;add4: SV Ry,V4 ;storetheresultVectorLoad/StoreUnits&Memories通常Load/Store部件的Start-up時(shí)間較長(zhǎng)存儲(chǔ)系統(tǒng)必須能有效地支持向量處理(即提供足夠的帶寬,有足夠高的供數(shù)率)(#lanesxword)/clockcycle許多向量處理器使用多存儲(chǔ)體技術(shù)(獨(dú)立存儲(chǔ)體vs.簡(jiǎn)單的多體交叉技術(shù)) 1)支持每個(gè)cycle可以有多個(gè)loads/stores操作

2)支持非順序訪(fǎng)問(wèn)注意:memorybanks的數(shù)量>memorylatency以避免stallsmbanks=>mwordspermemorylatencylclocksifm<l(memorylatency),thengapinmemorypipeline:clock: 0 … l

l+1 l+2 … l+m-1 l+m … 2lword: -- … 0 1 2 … m-1 -- … mVectorLength當(dāng)向量的長(zhǎng)度不是64時(shí)(假設(shè)向量寄存器的長(zhǎng)度是64)怎么辦?vector-lengthregister(VLR)控制特定向量操作的長(zhǎng)度,包括向量的load/store.(當(dāng)然一次操作的向量的長(zhǎng)度不能>向量寄存器的長(zhǎng)度)例如: do10i=1,n10 Y(i)=a*X(i)+Y(i)n的值只有在運(yùn)行時(shí)才能知道

n>Max.VectorLength(MVL)怎么辦?StripMining(分段開(kāi)采)假設(shè)VectorLength>Max.VectorLength(MVL)?Stripmining:產(chǎn)生新的代碼,使得每個(gè)向量操作的元素?cái)?shù)

MVL第一次循環(huán)做最小片(nmodMVL),以后按VL=MVL操作 low=1

VL=(nmodMVL)/*findtheoddsizepiece*/

do1j=0,(n/MVL)/*outerloop*/ do10i=low,low+VL-1/*runsforlengthVL*/

Y(i)=a*X(i)+Y(i)/*mainoperation*/

10 continue

low=low+VL/*startofnextvector*/

VL=MVL/*resetthelengthtomax*/

1 continueADDIR2,R0,#1600;total#bytesinvectorADDR2,R2,Ra;addressoftheendofAvector

ADDIR1,R0,#8;loadslengthof1stsegmentMOVI2SVLR,R1;loadvectorlengthinVLRADDIR1,R0,#64;lengthinbytesof1stsegmentADDIR3,R0,#64;vectorlengthofothersegmentsLoop:LVV1,Rb;loadBMULSVV2,V1,Fs;vector*scalarSVRa,V2;storeAADDRa,Ra,R1;addressofnextsegmentofAADDRb,Rb,R1;addressofnextsegmentofB

ADDIR1,R0,#512;loadbyteoffsetnextsegmentMOVI2SVLR,R3;setlengthto64elementsSUBR4,R2,Ra;attheendofA?BNEZR4,Loop;ifnot,gobackTstart=12+7+12=31T200=660+4*31=784每一元素的執(zhí)行時(shí)間=784/200=3.9VectorStride假設(shè)處理順序相鄰的元素在存儲(chǔ)器中不是順序存儲(chǔ),例如

do10i=1,100 do10j=1,100 A(i,j)=0.0 do10k=1,10010 A(i,j)=A(i,j)+B(i,k)*C(k,j)B或C的兩次訪(fǎng)問(wèn)不會(huì)相鄰(相隔800bytes)stride:向量中相鄰元素間的距離

=>LVWS(loadvectorwithstride)instructionStrides=>會(huì)導(dǎo)致體沖突

(e.g.,stride=32and16banks)VectorOpt#1:Chaining例如: MULV V1,V2,V3 ADDV V4,V1,V5 ;separateconvoy?chaining:向量寄存器(V1)不是單一實(shí)體,而是由一組寄存器組成,那么可以采用流水線(xiàn)技術(shù),一旦完成向量寄存器中的一個(gè)元素的計(jì)算,就流向下一階段Flexiblechaining:允許某一向量鏈接到任何其他活動(dòng)的向量操作上=>需要有多個(gè)讀寫(xiě)操作端口,允許多條向量指令訪(fǎng)問(wèn)同一個(gè)向量寄存器

只要有足夠的硬件,通過(guò)鏈接增加了Convoy中包含的指令條數(shù),從而可能會(huì)減少Chime數(shù)766464764664UnchainedChainedmultvaddvmultvaddvVectorUnitStructureLaneFunctionalUnitVectorRegistersMemorySubsystemElements0,4,8,…Elements1,5,9,…Elements2,6,10,…Elements3,7,11,…T0VectorMicroprocessor(UCB/ICSI,1995)LaneVectorregisterelementsstripedoverlanes[0][8][16][24][1][9][17][25][2][10][18][26][3][11][19][27][4][12][20][28][5][13][21][29][6][14][22][30][7][15][23][31]loadVectorInstructionParallelismCanoverlapexecutionofmultiplevectorinstructionsexamplemachinehas32elementspervectorregisterand8lanesloadmulmuladdaddLoadUnitMultiplyUnitAddUnittimeInstructionissueComplete24operations/cyclewhileissuing1shortinstruction/cycleVectorOpt#2:ConditionalExecutionSuppose: do100i=1,64 if(A(i).ne.0)then A(i)=A(i)–B(i) endif 100continuevector-maskcontrol

使用長(zhǎng)度為MVL的布爾向量控制向量指令的執(zhí)行

當(dāng)vector-maskregister

使能時(shí),向量指令操作僅對(duì)vector-maskregister中對(duì)應(yīng)位為1的分量起作用LVV1,Ra;loadvectorAintoV1LVV2,Rb;loadvectorBL.DF0,#0;loadFPzerointoF0SNEVS.DV1,F0;setsVM(i)to1ifV1(i)!=F0SUBV.DV1,V1,V2;subtractundervectormaskCVM;setthevectormasktoall1sSVRa,V1;storetheresultinA使用vector-mask寄存器的缺陷條件不滿(mǎn)足時(shí)向量指令仍然需要花費(fèi)時(shí)間有些向量處理器帶條件的向量執(zhí)行僅控制向目標(biāo)寄存器的寫(xiě)操作,可能會(huì)有除法錯(cuò)。VectorOpt#3:SparseMatricesSuppose: do 100i=1,n100 A(K(i))=A(K(i))+C(M(i))gather(LVI)operation使用indexvector

中給出的偏移再加基址來(lái)讀取=>anonsparsevectorinavectorregister這些元素以密集的方式操作完成后,再使用同樣的indexvector存儲(chǔ)到稀疏矩陣的對(duì)應(yīng)位置這些操作編譯時(shí)可能無(wú)法完成。主要原因:編譯器無(wú)法預(yù)知Ki以及是否有數(shù)據(jù)相關(guān)使用CVI

設(shè)置步長(zhǎng)(index0,1xm,2xm,...,63xm)SparseMatrixExampleCache(1993)vs.Vector(1988) IBMRS6000 CrayYMPClock 72MHz 167MHzCache 256KB 0.25KBLinpack 140MFLOPS 160(1.1)SparseMatrix 17MFLOPS 125(7.3)

(CholeskyBlocked)Cache:1addresspercacheblock(32Bto64B)Vector:1addressperelement(4B)Challenges:

VectorExamplewithdependency/*Multiplya[m][k]*b[k][n]togetc[m][n]*/for(i=1;i<m;i++){for(j=1;j<n;j++){sum=0;for(t=1;t<k;t++){

sum+=a[i][t]*b[t][j];}c[i][j]=sum;}}Problem:creatingsumofelementsinavector=slowandrequiresuseofscalarunitOptimizedVectorExample/*Multiplya[m][k]*b[k][n]togetc[m][n]*/for(i=1;i<m;i++){for(j=1;j<n;j+=32)/*Stepj32atatime.*/{sum[0:31]=0;/*Initializeavectorregistertozeros.*/for(t=1;t<k;t++){

a_scalar=a[i][t];/*Getscalarfromamatrix.*/

b_vector[0:31]=b[t][j:j+31];/*Getvectorfrombmatrix.*/

prod[0:31]=b_vector[0:31]*a_scalar;/*Doavector-scalar multiply.*/sum[0:31]+=prod[0:31];/*Vector-vectoraddintoresults.*/}/*Unit-stridestoreofvectorofresults.*/c[i][j:j+31]=sum[0:31];}}Considervectorprocessorasacollectionof32virtualprocessors!Doesnotneedreduce!ApplicationsLimitedtoscientificcomputing?MultimediaProcessing(compress.,graphics,audiosynth,imageproc.)Standardbenchmarkkernels(MatrixMultiply,FFT,Convolution,Sort)LossyCompression(JPEG,MPEGvideoandaudio)LosslessCompression(Zeroremoval,RLE,Differencing,LZW)Cryptography(RSA,DES/IDEA,SHA/MD5)SpeechandhandwritingrecognitionOperatingsystems/Networking(memcpy,memset,parity,checksum)Databases(hash/join,datamining,image/videoserving)Languagerun-timesupport(stdlib,garbagecollection)evenSPECint95VectorforMultimedia?IntelMMX:57new80x86instructions(1stsince386)similartoIntel860,Mot.88110,HPPA-71000LC,UltraSPARC3datatypes:88-bit,416-bit,232-bitin64bitsreuse8FPregisters(FPandMMXcannotmix)-shortvector:load,add,store88-bitoperandsClaim:overallspeedup1.5to2Xfor2D/3Dgraphics,audio,video,speech,comm.,...useindriversoraddedtolibraryroutines;nocompiler+MMXInstructionsMove32b,64bAdd,Subtractinparallel:88b,416b,232bopt.signed/unsignedsaturate(settomax)ifoverflowShifts(sll,srl,sra),And,AndNot,Or,Xor

inparallel:88b,416b,232bMultiply,Multiply-Addinparallel:416bCompare=,>inparallel:88b,416b,232bsetsfieldto0s(false)or1s(true);removesbranchesPack/UnpackConvert32b<–>16b,16b<–>8bPacksaturates(settomax)ifnumberistoolarge

VectorsandVariableDataWidthProgrammerthinksintermsofvectorsofdataofsomewidth(8,16,32,or64bits)Goodformultimedia;Moreelegantthan

MMX-styleextensionsDon’thavetoworryabouthowdatastoredinhardwareNoneedforexplicitpack/unpackoperationsJustthinkofmorevirtualprocessorsoperatingonnarrowdataExpandMaximumVectorLengthwithdecreasingdatawidth:

64x64bit,128x32bit,256x16bit,512x8bitMediaprocesing:

Vectorizable?VectorLengths?

Kernel

VectorlengthMatrixtranspose/multiply #verticesatonceDCT(video,communication) imagewidthFFT(audio) 256-1024Motionestimation(video) imagewidth,iw/16Gammacorrection(video) imagewidthHaartransform(mediamining) imagewidthMedianfilter(imageprocessing) imagewidthSeparableconvolution(c.) imagewidth(fromPradeepDubey-IBM,)VectorPitfallsPitfall:Concentratingonpeakperformanceandignoringstart-upoverhead:e.g.NV(lengthfasterthanscalar)>100(CDC-star)Pitfall:Increasingvectorperformance,withoutcomparableincreasesinscalarperformance

(Amdahl'sLaw)failureofCraycompetitorfromhisformercompanyPitfall:GoodprocessorvectorperformancewithoutprovidinggoodmemorybandwidthMMX?VectorAdvantagesEasytoget

highperformance;Noperations:areindependentusesamefunctionalunitaccessdisjointregistersaccessregistersinsameorderaspreviousinstructionsaccesscontiguousmemorywordsorknownpatterncanexploitlargememorybandwidthhidememorylatency(andanyotherlatency)Scalable(gethigherperformanceasmoreHWresourcesavailable)Compact:DescribeNoperationswith1shortinstruction(v.VLIW)Predictable(real-time)performancevs.statisticalperformance(cache)Multimediaready:chooseN*64b,2N*32b,4N*16b,8N*8bMature,developedcompilertechnologyVectorDisadvantage:OutofFashionVectorsAreInexpensiveScalarNopspercycle

2)circuitryHPPA-80004-wayissuereorderbuffer:

850Ktransistorsincl.6,7205-bitregisternumbercomparatorsVectorNopspercycle

2)circuitryT0vectormicro24opspercycle730Ktransistorstotalonly235-bitregisternumbercomparatorsNofloatingpointMIPSR10000vs.T0*SeeVectorsLowerPowerVectorOneinstructionfetch,decode,dispatchpervectorStructuredregisteraccesses

Smallercodeforhighperformance,lesspowerininstruct

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶(hù)所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶(hù)上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶(hù)上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶(hù)因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

最新文檔

評(píng)論

0/150

提交評(píng)論