版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
ASurveyofEfficientControlFlowAndExceptionalFlowSupportOnModernGPU沈陽Dec,2012WhatisaGPU?GPU=GraphicsProcessingUnitOptimizedforHighlyParallelWorkloadsHighlyProgrammableNvidia’sGTX580:16x32-widemultiprocessors512ALUs24,576concurrentthreads2WhyGPU?Floating-pointOperationsperSecondfortheGPUandCPU3WhyGPU?GPUdevotesmoretransistorstodataprocessingWell-suitedfordataparallelcomputationsMorecomputationpower/Watt4HowDoesaGPULookLike?StreamingMultiprocessor(nextslide)NvidiaFermi’sfloorplan:16SMspositionedaroundL2cache.EachSMisaverticalrectangularstripthatcontainanorangeportion(scheduleranddispatch),agreenportion(executionunits),andlightblueportions(registerL1cache)5StreamingMultiprocessorCUDAcore:actuallyexecutionunitcallitSIMDlaneforaccuracy512CUDACores16SMs32Cores/SM6ProgrammingModelKernel:ApieceofprogramrunsonGPUDonotconfusewithoperatingsystemconceptExample:__global__voiddaxpy(intn,doublea,double*x,double*y){inti=blockIdx.x*blockDim.x+threadIdx.x;if(i<n)y[i]=a*x[i]+y[i];}main(){…//allocateandinitializememory//InvokeparallelSAXPYkernelwith256threads/blockintnblocks=(n+255)/256;
daxpy<<<nblocks,256>>>(n,2.0,x,y);…//transferresultsfromGPUtoCPU}78Non-GraphicsAPI:CUDA,OpenCL,DirectComputeProgrammingModel:Hierarchyofscalarthreadsdaxpy<<<nblocks,256>>>(n,2.0,x,y)Wholeloopiscalledgrid,decomposedintonblockblocks,eachblockcontains256threadsHardwaregroupscalarthreadsintowarps,runinlockstepGridBlocksBlocksThreadBlocksProgrammingModelScalarThread123456789101112ScalarThreadWarp123456789101112Single-Instruction-MultiThreadSIMTExecutionModelProgrammersseesMIMDthreads(scalar)GPUHWbundlesthreadsintowarpsandrunstheminlocksteponSIMDhardwareTimeAT1T2T3T4BT1T2T3T4CT1T2DT3T4ET1T2T3T4A:v=foo[tid.x];B:if(v<10)C:v=0;elseD:v=10;E:w=bar[tid.x]+v;foo[]={4,8,12,16};9threadwarpbranchdivergence10Re-convergenceStackA:K=A[tid.x];B:if(K>10)C:K=10;elseD:K=0;E:B=C[tid.x]+K;TimeBranchDivergenceDE0011CE1100B-1111PCRPCActiveMaskE-1111A1234B1234C12----D----34E123450%SIMDEfficiency!ReconvergenceStackEA[]={4,8,12,16};Insomecases:SIMDEfficiency
20%GPUMicroarchitectureOverview
(10,000feet)Single-Instruction,Multiple-ThreadsGPUInterconnectionNetworkSIMTCoreClusterSIMTCoreSIMTCoreMemoryPartitionGDDR3/GDDR5MemoryPartitionGDDR3/GDDR5MemoryPartitionGDDR3/GDDR5Off-chip
DRAMSIMTCoreClusterSIMTCoreSIMTCoreSIMTCoreClusterSIMTCoreSIMTCore1112GPUMicroarchitectureInterconnectionNetworkMemoryPartitionLast-LevelCacheBankOff-ChipDRAMChannelMemoryPartitionLast-LevelCacheBankOff-ChipDRAMChannelMemoryPartitionLast-LevelCacheBankOff-ChipDRAMChannelSIMTCoreSIMTCoreSIMTCoreSIMTCoreSIMTCoreSIMTFrontEndSIMDDatapathFetchDecodeScheduleBranchDone(WarpID)MemorySubsystemIcnt.NetworkSMemL1D$Tex$Const$13DynamicWarpFormation:KeyIdeaIdea:FormnewwarpatdivergenceEnoughthreadsbranchingtoeachpathtocreatefullnewwarps14DynamicWarpFormation:ExampleAABBGGAACCDDEEFFTimeAABBGGAACDEEFTimeAx/1111y/1111Bx/1110y/0011Cx/1000y/0010Dx/0110y/0001Fx/0001y/1100Ex/1110y/0011Gx/1111y/1111AnewwarpcreatedfromscalarthreadsofbothWarpxandyexecutingatBasicBlockDDExecutionofWarpxatBasicBlockAExecutionofWarpyatBasicBlockALegendAABaselineDynamicWarpFormation15ThreadSchedulerPC-WarpLUTWarpPoolIssue
LogicWarpAllocatorTIDxNPCATIDxNPCBHHTIDxNPCPrioTIDxNPCPrioOCCPCIDXOCCPCIDXWarpUpdateRegisterTWarpUpdateRegisterNTREQREQTIDxNPCPrioA5678A1234DynamicWarpFormation:
HardwareImplementation5786BC10110100B230110B0B5238B0010B271342BC01101001C11001C14C61101C1NoLaneConflictA:BEQR2,BC:…X1234Y5678X1234X1234X1234X1234Y5678Y5678Y5678Y5678Z5238Z5238Z5238CoalescedMemoryAccessNotpreservedbyDWFDWF
Pathologies
Sensitivetoschedulingpolicy169634D--10----D1234E5678E9101112ETime1278C5--1112C9634D--10----D1278E5--1112E9634E--10----EB:if(K>10)C:K=10;elseD:K=0;E:B=C[tid.x]+K;1000scycles16E:B=C[tid.x]+K;#Acc=3#Acc=
91234E5678E9101112E0x1000x1400x18012
7
12E9
6
3
8E510
11
4EMemoryMemory0x1000x1400x180NoDWFWithDWF17ThreadBlockCompactionRunathreadblocklikeawarpWholeblockmovebetweencoherent/divergentcodeBlock-widestacktotrackexec.pathsreconvg.Barrier@Branch/reconvergept.Allavail.threadsarriveatbranchInsensitivetowarpschedulingWarpcompactionRegroupingwithallavail.threadsIfnodivergence,givesstaticwarparrangementStarvationExtraUncoalescedMemoryAccess18ThreadBlockCompactionPCRPCActiveThreadsA-123456789101112DE----34--6----910----CE12----5--78----1112E-123456789101112Time1278C5--1112C9634D--10----D5678A9101112A1234A5678E9101112E1234EA:K=A[tid.x];B:if(K>10)C:K=10;elseD:K=0;E:B=C[tid.x]+K;5678A9101112A1234A5--78C----1112C12----C--6----D910----D----34D5678E9101112E1278E------------------------------------------------CAPRIBenefitFromCompaction?Non-divergentbranch,noneedtocompactDivergentbranch,yetcompaction-ineffectiveIdea:likebranchprediction,weusepredictionforcompaction1920TBCCAPRILegendPathXy:y-thinstructioninbasicblockXWarpWn:WarpwithWarp-IDnCAPRICAPRI21SimultaneousBranchInterweavingDWF,TBCimprovesSIMDlaneutilizationbyformingnewwarpsexecutingthesameinstructionatdivergenceAnotherapproach:Simultaneouslyexecuteinstructionsfromdifferentbranches22SimultaneousBranchInterweaving23SimultaneousBranchInterweavingcascadescheduler24SimultaneousWarpInterweavingSBIimprovesthroughputwhenbranchpathsarebalancednobenefitwhentheworkloadofeachthreadofawarpisunbalanced.Example:ifblockswithnoelsecounterpartsSolution:Simultaneouslyexecuteinstructionsfromdifferentwarpsifnoconflictsinactivitymask25SimultaneousWarpInterweaving26ExceptionSupport:MotivationDWF,TBC,CAPRI,SBI&SWIallimprovescontrolflowefficiencyGood,butonlyefficiencygaininmachineProgrammersstillneedtoreasonaboutavailablememoryAnothertypeofcontrolflowthatimprovesprogrammer’sefficiency:exceptionEnablesmanyfeatures:demandpaging,contextswitch,executeprogramspartiallyresidentinmemory,…27WhynotjustborrowCPUideas?CPUsusebufferingtopreservearch.stateFuturefile,Historyfile,Re-orderBuffer…ButGPUshave1000xasmanyregistersNotpractical!28ExceptionSupportDoweneedtoalwaysrestartatthepointwhereexceptionoccurs?Keyidea:savestateatthepointwherelivestateissmall.Ifexceptionoccurs,restorethestatethereandreplaytheprogramatthatpointKeyEnabler:Idempotentregions,restartableregionsproducingthesameeffect29ExceptionSupport30Example:ContextSwitchMustsaveandrestorearchitecturestate
But...GPUshavemegabytesofregisterstateSaveonlylivestateSavestateatpointsofminimallivestateABB#liveregisters23Candidatecutpoint942B#liveregisters2Exceptionhandler31SourceCodeCom
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 學生實習就業(yè)合作協(xié)議書
- 個體化麻醉方案在神經(jīng)血流動力學管理中應用
- 2026年餐飲業(yè)供應鏈管理部副經(jīng)理面試問題與答案解析
- 個體化飲食處方醫(yī)養(yǎng)管理模式
- 2026年經(jīng)濟政策分析專家面試指南及答案解析
- 耗材更換服務協(xié)議書
- 部編版四年級下冊語文《習作我的“自畫像”》教材教案(2025-2026學年)
- 脊柱滑脫診斷和治療教案(2025-2026學年)
- 小學語文二年級夕陽真美之二教案
- 六年級英語上冊UnitAFairyTale教案湘少版湘少版小學六年級上冊英語教案(2025-2026學年)
- 落葉清掃壓縮機設計答辯
- 廣東省建筑裝飾裝修工程質(zhì)量評價標準
- 珍愛生命活在當下-高一上學期生命教育主題班會課件
- 湖北省武漢市洪山區(qū)2023-2024學年八年級上學期期末數(shù)學試題
- 應用寫作-終結(jié)性考核-國開(SC)-參考資料
- 胸痛中心出院病人隨訪制度
- 場地租憑轉(zhuǎn)讓合同協(xié)議書
- 口腔科科室建設規(guī)劃
- 動物活體成像技術(shù)
- 新教科版科學四年級上冊分組實驗報告單
- 雷達截面與隱身技術(shù)課件
評論
0/150
提交評論