并行計算課件

上傳人：我*** IP屬地：北京上傳時間：2024-01-17 格式：DOCX 頁數(shù)：57 大小：2.15MB 積分：16 舉報 版權申訴

已閱讀5頁，還剩52頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

GPUArchitectureGPUArchitectureindetailandBinZHOU1?Generalguideline??Generalguideline??Keplerin3~~~~peakinst/sComparetopeakGB/sFindGeneralOptimizationStrategies:GeneralOptimizationStrategies:?FindoutthelimitingfactorinkernelMemorybandwidthbound(memory?Measureeffectivememory/instructionMemory?IftheMemory?Ifthecodeismemory-boundandeffectivememorythroughputismuchlowerthanthepeak?Purpose:accessonlydatathatareabsolutely?MajorReduceredundantaccess:read-onlycache,shared?Ifyou?IfyoufindoutthecodeisinstructionCompute-intensivealgorithmcaneasilybecomememory-boundifnotcarefulTypically,worryaboutinstructionoptimizationaftermemoryandexecutionconfigurationoptimizations?–Uselessinstructionstogetthesamejob?MajorUsehighthroughputinstructions(ex.widerReducewastedinstructions:branchdivergence,reducereplay(conflict),?Whenthecode?Whenthecodeislatency–Boththememoryandinstructionthroughputsarefarfromthe?Latencyhiding:switching–Athreadblockswhenoneoftheoperandsisn’t?Purpose:haveenoughwarpstohide?Majortechniques:increaseactivewarps,increaseCPU-GPUCPU-GPU9MinimizeCPU-GPUdataHost<->devicedataMinimizeCPU-GPUdataHost<->devicedatatransferhasmuchlowerbandwidththanglobalmemoryaccess.16GB/s(PCIex16Gen3)vs250GB/s&3.95Tinst/sSometimesit’sevenbettertorecomputeonGPUMoveCPUcodestoGPUthatdonothaveperformancegainsifitcanreducedatatransferGroupOnelargetransfermuchbetterthanmanysmallonesOverlapmemorytransferwithcomputationRevisitGPURevisitGPUProcessing CopyinputdatafromCPUmemorytoRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteRevisitGPURevisitGPUProcessingCopyinputdatafromCPUmemorytoLoadGPUcodeandexecuteCopyresultsfromGPUmemorytoCPU??????????????????????????=??????????+??????????+?MoreStreamStreamStreamStreamStreamsrc1,size,kernel<<<grid,Streamsrc1,size,kernel<<<grid,0,src1,size,StreamStreamcudaMemcpyAsync(dst2,src2,size,kernel<<<grid,block,0,kernel<<<grid,block,0,cudaMemcpyAsync(dst1,src1,size,cudaMemcpyDeviceToHost,KEPLERKEPLERIN?NVIDIA1.31tflops?NVIDIA1.31tflopsdouble3.95tflopssingle250gb/secmemory2,688FunctionalUnits?#1onTop500inNVIDIAGK110-KeplerGK110SMXKeplerGK110SMXvsFermi3xPowergoesNewISAEncoding:NewISAEncoding:255Registersper?Fermilimit:63registersperAcommonFermiperformanceLeadstoexcessive?Kepler:Upto255registersper–EspeciallyhelpfulforFP64?FeatureofKeplerK20GPUstoincreaseapplication?FeatureofKeplerK20GPUstoincreaseapplicationthroughputbyenablingworktobescheduledontotheGPUinparallel?TwowaystotakeCUDAStreams–nowtheyreallyareCUDAProxyforMPI–concurrentCUDAMPIprocessesononeGPUWorkWorkKeplerGridManagementPending&SuspendedStreamQueueWorkWorkKeplerGridManagementPending&SuspendedStreamQueueyemFermiA<<<>>>;B<<<>>>StreamP<<<>>>FermiA<<<>>>;B<<<>>>StreamP<<<>>>;Q<<<>>>StreamHardwareWorkX<<<>>>;Y<<<>>>;StreamFermiallows16-way–––Upto16gridscanrunatButCUDAstreamsmultiplexintoasingleOverlaponlyatstreamA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;YA<<<>>>;BStreamP<<<>>>;Q<<<>>>;StreamX<<<>>>;Y<<<>>>;MultipleHardwareWorkStreamKeplerallows32-wayOneworkqueueperConcurrencyatfull-streamlevelNointer-streamdependenciesABCDEFCPUSharedABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedABCDEFCPUABCDEFCPUSharedHyper-Q:SimultaneousCPUCUDASharedClientHyper-Q:SimultaneousCPUCUDASharedClient–ServerSoftwareABCDEFABCDEFDDCFACCBBEFFEBAAEDDDCFACCBBEFFEBAAEDWhatisDynamicTheabilitytolaunchnewWhatisDynamicTheabilitytolaunchnewkernelsfromtheDynamically-basedonrun-timeSimultaneously-frommultiplethreadsatIndependently-eachthreadcanlaunchadifferentFermi:OnlyCPUcangenerateGPUKepler:GPUcangenerateworkforWhatDoesItGPUasCo-Autonomous,DynamicWhatDoesItGPUasCo-Autonomous,DynamicNewTypesof??NewTypesof??RecursiveParallelAlgorithmslikeQuickAdaptiveMeshAlgorithmslikeComputationalallocatedtoregionsofCUDAonFamiliarProgrammingAXBYCZglobalFamiliarProgrammingAXBYCZglobalvoidB(float{X<<<...>>>Y<<<...>>>Z<<<...>>>}intmain()float*data;A<<<...>>>B<<<...>>>C<<<...>>>return0;}CodeLaunchisper-andCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-launchedkernelsandCUDAobjectslikestreamsarevisibletoallthreadsinathreadblockcannotbepassedtochildif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}}CodeLaunchisper-anddevicefloatglobalCodeLaunchisper-anddevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=CUDAprimitivesareper-Syncincludesallbyanythreadintheif(tid==0)launch<<<128,256>>>(buf);}cudaDeviceSynchronize()imply}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=if(tid==0)launch<<<128,256>>>(buf);}}MemoryCodeLaunchimplies(childseesparentstateatMemoryCodeLaunchimplies(childseesparentstateattimeofSyncimplies(parentseeschildwritesafterdevicefloatglobalvoidcnp(float{inttid=threadIdx.x;if(tid%2)buf[tid/2]=Local&sharedmemoryConstantsareif(tid==0)launch<<<128,256>>>(buf);}syncthreads();if(tid==0){cudaMemcpyAsync(data,buf,}ProvidestechnologynecessarytoenablelowerlatencyProvidestechnologynecessarytoenablelowerlatencymemorytransfersbetweenGPUandotherPCIEdeviceswithoutrequiringcustomhardware.??APIanddocumentationfordevicedriver?AvailableonLinux?SupportedonKeplerQuadroandTelsaNVIDIAGPUDirect?NowSupportsNVIDIAGPUDirect?NowSupportsMorethreadsareMorethreadsare???2-3xthroughputperclockperMemorybandwidthBiggerSMhavebiggerMorethreadare?Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerMorethreadare?Ifyoualreadylaunchedenoughthreads,thefollowingenhancementonkeplerwillensureenoughactivewarpsonSMs.2xregisterfileoneach?–––E.g.63registersperthread,blockDimInFermi16activewarpsInKepler32active?2xsimultaneousblocksper–E.g.16registersperthread,blockDim–InFermi96*8/32=24active–Inkepler96*16/32=46active?Ifonekernelcan’tIfonekernelcan’tlaunchenough?ConcurrentGK110allowsupto32concurrentkernelstoHyper-Using

人人文庫> 全部分類> 教育資料 > 課件下載

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲空間，僅對用戶上傳內容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

并行計算課件

文檔簡介

溫馨提示

最新文檔

評論

并行計算課件

文檔簡介

溫馨提示

最新文檔

評論

相關文檔