版權說明:本文檔由用戶提供并上傳,收益歸屬內容提供方,若內容存在侵權,請進行舉報或認領
文檔簡介
GPUArchitectureGPUArchitectureindetailandBinZHOU1?Generalguideline??Generalguideline??Keplerin3~~~~peakinst/sComparetopeakGB/sFindGeneralOptimizationStrategies:GeneralOptimizationStrategies:?FindoutthelimitingfactorinkernelMemorybandwidthbound(memory?Measureeffectivememory/instructionMemory?IftheMemory?Ifthecodeismemory-boundandeffectivememorythroughputismuchlowerthanthepeak?Purpose:accessonlydatathatareabsolutely?MajorReduceredundantaccess:read-onlycache,shared?Ifyou?IfyoufindoutthecodeisinstructionCompute-intensivealgorithmcaneasilybecomememory-boundifnotcarefulTypically,worryaboutinstructionoptimizationaftermemoryandexecutionconfigurationoptimizations?–Uselessinstructionstogetthesamejob?MajorUsehighthroughputinstructions(ex.widerReducewastedinstructions:branchdivergence,reducereplay(conflict),?Whenthecode?Whenthecodeislatency–Boththememoryandinstructionthroughputsarefarfromthe?Latencyhiding:switching–Athreadblockswhenoneoftheoperandsisn’t?Purpose:haveenoughwarpstohide?Majortechniques:increaseactivewarps,increaseCPU-GPUCPU-GPU9MinimizeCPU-GPUdataHost<->devicedataMinimizeCPU-GPUdataHost<->devicedatatransferhasmuchlowerbandwidththanglobalmemoryaccess.16GB/s(PCIex16Gen3)vs250GB/s&3.95Tinst/sSometimesit’sevenbettertorecomputeonGPUMoveCPUcodestoGPUthatdonothaveperformancegainsifitcanreducedatatransferGroupOnelargetransfermuchbetterthanmanysmallonesOverlapmemorytransferwithcomputationRevisitGPURevisitGPUProcessing CopyinputdatafromCPUmemorytoRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteCopyresultsfromGPUmemorytoCPU??????????????????????????=??????????+??????????+?MoreStreamStreamStreamStreamStreamsrc1,size,kernel<<<grid,Streamsrc1,size,kernel<<<grid,0,src1,size,StreamStreamcudaMemcpyAsync(dst2,src2,size,kernel<<<grid,block,0,kernel<<<grid,block,0,cudaMemcpyAsync(dst1,src1,size,cudaMemcpyDeviceToHost,KEPLERKEPLERIN?NVIDIA1.31tflops?NVIDIA1.31tflopsdouble3.95tflopssingle250gb/secmemory2,688FunctionalUnits?#1onTop500inNVIDIAGK110-KeplerGK110SMXKeplerGK110SMXvsFermi3xPowergoesNewISAEncoding:NewISAEncoding:255Registersper?Fermilimit:63registersperAcommonFermiperformanceLeadstoexcessive?Kepler:Upto255registersper–EspeciallyhelpfulforFP64?FeatureofKeplerK20GPUstoincreaseapplication?FeatureofKeplerK20GPUstoincreaseapplicationthroughputbyenablingworktobescheduledontotheGPUinparallel?TwowaystotakeCUDAStreams–nowtheyreallyareCUDAProxyforMPI–concurrentCUDAMPIprocessesononeGPUWorkWorkKeplerGridManagementPending&SuspendedStreamQueueWorkWorkKeplerGridManagementPending&SuspendedStreamQueueyemFermiA<<<>>>;B<<<>>>StreamP<<<>>>FermiA<<<>>>;B<<<>>>StreamP<<<>>>;Q<<<>>>StreamHardwareWorkX<<<>>>;Y<<<>>>;StreamFermiallows16-way–––Upto16gridscanrunatButCUDAstreamsmultiplexintoasingleOverlaponlyatstreamA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;YA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;Y<<<>>>;MultipleHardwareWorkStreamKeplerallows32-wayOneworkqueueperConcurrencyatfull-streamlevelNointer-streamdependenciesABCDEFCPUSharedABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedHyper-Q:SimultaneousCPUCUDASharedClientHyper-Q:SimultaneousCPUCUDASharedClient–ServerSoftwareABCDEFABCDEFDDCFACCBBEFFEBAAEDDDCFACCBBEFFEBAAEDWhatisDynamicTheabilitytolaunchnewWhatisDynamicTheabilitytolaunchnewkernelsfromtheDynamically-basedonrun-timeSimultaneously-frommultiplethreadsatIndependently-eachthreadcanlaunchadifferentFermi:OnlyCPUcangenerateGPUKepler:GPUcangenerateworkforWhatDoesItGPUasCo-Autonomous,DynamicWhatDoesItGPUasCo-Autonomous,DynamicNewTypesof??NewTypesof??RecursiveParallelAlgorithmslikeQuickAdaptiveMeshAlgorithmslikeComputationalallocatedtoregionsofCUDAonFamiliarProgrammingAXBYCZglobalFamiliarProgrammingAXBYCZglobalvoidB(float{X<<<...>>>Y<<<...>>>Z<<<...>>>}intmain()float*data;A<<<...>>>B<<<...>>>C<<<...>>>return0;}CodeLaunchisper-andCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-launchedkernelsandCUDAobjectslikestreamsarevisibletoallthreadsinathreadblockcannotbepassedtochildif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatglobalCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}cudaDeviceSynchronize()imply}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateatMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=Local&sharedmemoryConstantsareif(tid==0)launch<<<128,256>>>(buf);}syncthreads();if(tid==0){cudaMemcpyAsync(data,buf,}ProvidestechnologynecessarytoenablelowerlatencyProvidestechnologynecessarytoenablelowerlatencymemorytransfersbetweenGPUandotherPCIEdeviceswithoutrequiringcustomhardware.??APIanddocumentationfordevicedriver?AvailableonLinux?SupportedonKeplerQuadroandTelsaNVIDIAGPUDirect?NowSupportsNVIDIAGPUDirect?NowSupportsMorethreadsareMorethreadsare???2-3xthroughputperclockperMemorybandwidthBiggerSMhavebiggerMorethreadare?Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerMorethreadare?Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerwillensureenoughactivewarpsonSMs.2xregisterfileoneach?–––E.g.63registersperthread,blockDimInFermi16activewarpsInKepler32active?2xsimultaneousblocksper–E.g.16registersperthread,blockDim–InFermi96*8/32=24active–Inkepler96*16/32=46active?Ifonekernelcan’tIfonekernelcan’tlaunchenough?ConcurrentGK110allowsupto32concurrentkernelstoHyper-Using
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內容里面會有圖紙預覽,若沒有圖紙預覽就沒有圖紙。
- 4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內容本身不做任何修改或編輯,并不能對任何下載內容負責。
- 6. 下載文件中如有侵權或不適當內容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 入庫律師協(xié)議書
- 代收借款協(xié)議書
- 企業(yè)廚師協(xié)議書
- 供貨賠償協(xié)議書
- 2025-2030中國互聯(lián)網(wǎng)醫(yī)療行業(yè)發(fā)展現(xiàn)狀與投資評估規(guī)劃報告
- 2025-2030中國互聯(lián)網(wǎng)信息服務行業(yè)市場供需分析及投資評估規(guī)劃分析研究報告
- 代購代養(yǎng)協(xié)議書
- 木船修復施工方案(3篇)
- 農業(yè)舞臺活動策劃方案(3篇)
- 倉庫安保協(xié)議書
- 云南民族大學附屬高級中學2026屆高三聯(lián)考卷(四)化學+答案
- 楷書簡介課件復制
- 《做酸奶》課件教學課件
- 開展中長導管的臨床意義
- 2025西部機場集團航空物流有限公司招聘考試筆試備考試題及答案解析
- 《教育心理學》期末重點鞏固專練題庫(附答案)
- 《企業(yè)戰(zhàn)略管理》期末復習題庫 (一)
- 第5單元舞劇音樂《快樂的女戰(zhàn)士》課件人教版初中音樂九年級上冊
- 8.2《購買水果》(教案)-2025-2026學年三年級上冊數(shù)學 北師大版
- 按摩店大學生創(chuàng)業(yè)計劃
- 廣東省領航高中聯(lián)盟2025-2026學年高三上學期12月聯(lián)考政治試卷(含答案)
評論
0/150
提交評論