現(xiàn)代GPU上的高效控制流與異常流機制綜述_第1頁
現(xiàn)代GPU上的高效控制流與異常流機制綜述_第2頁
現(xiàn)代GPU上的高效控制流與異常流機制綜述_第3頁
現(xiàn)代GPU上的高效控制流與異常流機制綜述_第4頁
現(xiàn)代GPU上的高效控制流與異常流機制綜述_第5頁
已閱讀5頁,還剩29頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)

文檔簡介

ASurveyofEfficientControlFlowAndExceptionalFlowSupportOnModernGPU沈陽Dec,2012WhatisaGPU?GPU=GraphicsProcessingUnitOptimizedforHighlyParallelWorkloadsHighlyProgrammableNvidia’sGTX580:16x32-widemultiprocessors512ALUs24,576concurrentthreads2WhyGPU?Floating-pointOperationsperSecondfortheGPUandCPU3WhyGPU?GPUdevotesmoretransistorstodataprocessingWell-suitedfordataparallelcomputationsMorecomputationpower/Watt4HowDoesaGPULookLike?StreamingMultiprocessor(nextslide)NvidiaFermi’sfloorplan:16SMspositionedaroundL2cache.EachSMisaverticalrectangularstripthatcontainanorangeportion(scheduleranddispatch),agreenportion(executionunits),andlightblueportions(registerL1cache)5StreamingMultiprocessorCUDAcore:actuallyexecutionunitcallitSIMDlaneforaccuracy512CUDACores16SMs32Cores/SM6ProgrammingModelKernel:ApieceofprogramrunsonGPUDonotconfusewithoperatingsystemconceptExample:__global__voiddaxpy(intn,doublea,double*x,double*y){inti=blockIdx.x*blockDim.x+threadIdx.x;if(i<n)y[i]=a*x[i]+y[i];}main(){…//allocateandinitializememory//InvokeparallelSAXPYkernelwith256threads/blockintnblocks=(n+255)/256;

daxpy<<<nblocks,256>>>(n,2.0,x,y);…//transferresultsfromGPUtoCPU}78Non-GraphicsAPI:CUDA,OpenCL,DirectComputeProgrammingModel:Hierarchyofscalarthreadsdaxpy<<<nblocks,256>>>(n,2.0,x,y)Wholeloopiscalledgrid,decomposedintonblockblocks,eachblockcontains256threadsHardwaregroupscalarthreadsintowarps,runinlockstepGridBlocksBlocksThreadBlocksProgrammingModelScalarThread123456789101112ScalarThreadWarp123456789101112Single-Instruction-MultiThreadSIMTExecutionModelProgrammersseesMIMDthreads(scalar)GPUHWbundlesthreadsintowarpsandrunstheminlocksteponSIMDhardwareTimeAT1T2T3T4BT1T2T3T4CT1T2DT3T4ET1T2T3T4A:v=foo[tid.x];B:if(v<10)C:v=0;elseD:v=10;E:w=bar[tid.x]+v;foo[]={4,8,12,16};9threadwarpbranchdivergence10Re-convergenceStackA:K=A[tid.x];B:if(K>10)C:K=10;elseD:K=0;E:B=C[tid.x]+K;TimeBranchDivergenceDE0011CE1100B-1111PCRPCActiveMaskE-1111A1234B1234C12----D----34E123450%SIMDEfficiency!ReconvergenceStackEA[]={4,8,12,16};Insomecases:SIMDEfficiency

20%GPUMicroarchitectureOverview

(10,000feet)Single-Instruction,Multiple-ThreadsGPUInterconnectionNetworkSIMTCoreClusterSIMTCoreSIMTCoreMemoryPartitionGDDR3/GDDR5MemoryPartitionGDDR3/GDDR5MemoryPartitionGDDR3/GDDR5Off-chip

DRAMSIMTCoreClusterSIMTCoreSIMTCoreSIMTCoreClusterSIMTCoreSIMTCore1112GPUMicroarchitectureInterconnectionNetworkMemoryPartitionLast-LevelCacheBankOff-ChipDRAMChannelMemoryPartitionLast-LevelCacheBankOff-ChipDRAMChannelMemoryPartitionLast-LevelCacheBankOff-ChipDRAMChannelSIMTCoreSIMTCoreSIMTCoreSIMTCoreSIMTCoreSIMTFrontEndSIMDDatapathFetchDecodeScheduleBranchDone(WarpID)MemorySubsystemIcnt.NetworkSMemL1D$Tex$Const$13DynamicWarpFormation:KeyIdeaIdea:FormnewwarpatdivergenceEnoughthreadsbranchingtoeachpathtocreatefullnewwarps14DynamicWarpFormation:ExampleAABBGGAACCDDEEFFTimeAABBGGAACDEEFTimeAx/1111y/1111Bx/1110y/0011Cx/1000y/0010Dx/0110y/0001Fx/0001y/1100Ex/1110y/0011Gx/1111y/1111AnewwarpcreatedfromscalarthreadsofbothWarpxandyexecutingatBasicBlockDDExecutionofWarpxatBasicBlockAExecutionofWarpyatBasicBlockALegendAABaselineDynamicWarpFormation15ThreadSchedulerPC-WarpLUTWarpPoolIssue

LogicWarpAllocatorTIDxNPCATIDxNPCBHHTIDxNPCPrioTIDxNPCPrioOCCPCIDXOCCPCIDXWarpUpdateRegisterTWarpUpdateRegisterNTREQREQTIDxNPCPrioA5678A1234DynamicWarpFormation:

HardwareImplementation5786BC10110100B230110B0B5238B0010B271342BC01101001C11001C14C61101C1NoLaneConflictA:BEQR2,BC:…X1234Y5678X1234X1234X1234X1234Y5678Y5678Y5678Y5678Z5238Z5238Z5238CoalescedMemoryAccessNotpreservedbyDWFDWF

Pathologies

Sensitivetoschedulingpolicy169634D--10----D1234E5678E9101112ETime1278C5--1112C9634D--10----D1278E5--1112E9634E--10----EB:if(K>10)C:K=10;elseD:K=0;E:B=C[tid.x]+K;1000scycles16E:B=C[tid.x]+K;#Acc=3#Acc=

91234E5678E9101112E0x1000x1400x18012

7

12E9

6

3

8E510

11

4EMemoryMemory0x1000x1400x180NoDWFWithDWF17ThreadBlockCompactionRunathreadblocklikeawarpWholeblockmovebetweencoherent/divergentcodeBlock-widestacktotrackexec.pathsreconvg.Barrier@Branch/reconvergept.Allavail.threadsarriveatbranchInsensitivetowarpschedulingWarpcompactionRegroupingwithallavail.threadsIfnodivergence,givesstaticwarparrangementStarvationExtraUncoalescedMemoryAccess18ThreadBlockCompactionPCRPCActiveThreadsA-123456789101112DE----34--6----910----CE12----5--78----1112E-123456789101112Time1278C5--1112C9634D--10----D5678A9101112A1234A5678E9101112E1234EA:K=A[tid.x];B:if(K>10)C:K=10;elseD:K=0;E:B=C[tid.x]+K;5678A9101112A1234A5--78C----1112C12----C--6----D910----D----34D5678E9101112E1278E------------------------------------------------CAPRIBenefitFromCompaction?Non-divergentbranch,noneedtocompactDivergentbranch,yetcompaction-ineffectiveIdea:likebranchprediction,weusepredictionforcompaction1920TBCCAPRILegendPathXy:y-thinstructioninbasicblockXWarpWn:WarpwithWarp-IDnCAPRICAPRI21SimultaneousBranchInterweavingDWF,TBCimprovesSIMDlaneutilizationbyformingnewwarpsexecutingthesameinstructionatdivergenceAnotherapproach:Simultaneouslyexecuteinstructionsfromdifferentbranches22SimultaneousBranchInterweaving23SimultaneousBranchInterweavingcascadescheduler24SimultaneousWarpInterweavingSBIimprovesthroughputwhenbranchpathsarebalancednobenefitwhentheworkloadofeachthreadofawarpisunbalanced.Example:ifblockswithnoelsecounterpartsSolution:Simultaneouslyexecuteinstructionsfromdifferentwarpsifnoconflictsinactivitymask25SimultaneousWarpInterweaving26ExceptionSupport:MotivationDWF,TBC,CAPRI,SBI&SWIallimprovescontrolflowefficiencyGood,butonlyefficiencygaininmachineProgrammersstillneedtoreasonaboutavailablememoryAnothertypeofcontrolflowthatimprovesprogrammer’sefficiency:exceptionEnablesmanyfeatures:demandpaging,contextswitch,executeprogramspartiallyresidentinmemory,…27WhynotjustborrowCPUideas?CPUsusebufferingtopreservearch.stateFuturefile,Historyfile,Re-orderBuffer…ButGPUshave1000xasmanyregistersNotpractical!28ExceptionSupportDoweneedtoalwaysrestartatthepointwhereexceptionoccurs?Keyidea:savestateatthepointwherelivestateissmall.Ifexceptionoccurs,restorethestatethereandreplaytheprogramatthatpointKeyEnabler:Idempotentregions,restartableregionsproducingthesameeffect29ExceptionSupport30Example:ContextSwitchMustsaveandrestorearchitecturestate

But...GPUshavemegabytesofregisterstateSaveonlylivestateSavestateatpointsofminimallivestateABB#liveregisters23Candidatecutpoint942B#liveregisters2Exceptionhandler31SourceCodeCom

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
  • 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論