CUDA超大規(guī)模并行程序設(shè)計.ppt_第1頁
CUDA超大規(guī)模并行程序設(shè)計.ppt_第2頁
CUDA超大規(guī)模并行程序設(shè)計.ppt_第3頁
CUDA超大規(guī)模并行程序設(shè)計.ppt_第4頁
CUDA超大規(guī)模并行程序設(shè)計.ppt_第5頁
已閱讀5頁,還剩88頁未讀, 繼續(xù)免費閱讀

下載本文檔

版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認(rèn)領(lǐng)

文檔簡介

CUDA 超大規(guī)模并行程序設(shè)計,鄧仰東 清華大學(xué)微電子學(xué)研究所,提綱,從GPGPU到CUDA CUDA并行程序組織 并行執(zhí)行模型 CUDA基礎(chǔ) CUDA存儲器 CUDA程序設(shè)計工具 程序優(yōu)化,Graphic Processing Unit (GPU),用于個人計算機、工作站和游戲機的專用圖像顯示設(shè)備 顯示卡 nVidia和ATI (now AMD)是主要制造商 Intel準(zhǔn)備通過Larrabee進入這一市場 主板集成 Intel,3維圖像流水線,Framebuffer,Texture,CPU,GPU,實時3維高速圖形處理,一幀典型圖像 1M triangles 3M vertices 25M fragments 30 frames/s 30M triangles/s 90M vertices/s 750M fragments/s,傳統(tǒng)GPU架構(gòu),Graphics program,Vertex processors,Fragment processors,Pixel operations,Output image,GPU的強大運算能力,數(shù)據(jù)級并行: 計算一致性,專用存儲器通道 有效隱藏存儲器延時,General Purpose Computing on GPU (GPGPU),GPGPU,核心思想 用圖形語言描述通用計算問題 把數(shù)據(jù)映射到vertex或者fragment處理器 但是 硬件資源使用不充分 存儲器訪問方式嚴(yán)重受限 難以調(diào)試和查錯 高度圖形處理和編程技巧,NVidia G200 Architecture,CUDA: Compute Unified Device Architecture,通用并行計算模型 單指令、多數(shù)據(jù)執(zhí)行模式 (SIMD) 所有線程執(zhí)行同一段代碼(1000s threads on the fly) 大量并行計算資源處理不同數(shù)據(jù) 隱藏存儲器延時 提升計算通信比例 合并相鄰地址的內(nèi)存訪問 快速線程切換1 cycleGPU vs. 1000 cyclesCPU,混合計算模型,CUDA: 集成CPU + GPU C應(yīng)用程序 CPU: 順序執(zhí)行代碼 GPU = 超大規(guī)模數(shù)據(jù)并行協(xié)處理器 “批發(fā)”式執(zhí)行大量細(xì)粒度線程,kernel 0,CPU Serial Code,CPU Serial Code,GPU Parallel Code,GPU Parallel Code,Concurrent execution!,kernel 1,CUDA成功案例,CUDA性能,BLAS3: 127 GFLOPS /基本線性代數(shù): matrix-matrix FFT: 52 benchFFT*GFLOPS FDTD: 1.2 Gcells/sec /計算電動力學(xué) SSEARCH: 5.2 Gcells/sec /Smith-Waterman基因序列比較 Black Scholes: 4.7GOptions/sec /期權(quán)定價模型 VMD: 290 GFLOPS /分子動力學(xué)圖形顯示,Problem Instances for Sparse Matrix Vector Product (SMVP),SPMV Throughput on GTX280,SMVP Application: Static Timing Analysis,Adapted from Ramalingam, A. et. al. An Accurate Sparse Matrix Based Framework for Statistical Static Timing Analysis. ICCAD. 2006.,Static Timing Analysis Results on GTX280,提綱,從GPGPU到CUDA CUDA并行程序組織 并行執(zhí)行模型 CUDA基礎(chǔ) CUDA存儲器 CUDA程序設(shè)計工具 程序優(yōu)化,并行性的維度,1維 y = a + b /y, a, b vectors 2維 P = M N /P, M, N matrices 3維 CT or MRI imaging,=,并行線程組織結(jié)構(gòu),Thread: 并行的基本單位 Thread block: 互相合作的線程組 Cooperative Thread Array (CTA) 允許彼此同步 通過快速共享內(nèi)存交換數(shù)據(jù) 以1維、2維或3維組織 最多包含512個線程 Grid: 一組thread block 以1維或2維組織 共享全局內(nèi)存 Kernel: 在GPU上執(zhí)行的核心程序 One kernel one grid,Parallel Program Organization in CUDA,Thread,Thread block,Grid,SP,Software,Hardware,SM,GPU,并行線程執(zhí)行,調(diào)用kernel function 需要指定執(zhí)行配置 Threads和blocks具有IDs threadIdx: 1D, 2D, or 3D blockIdx: 1D, or 2D 由此決定相應(yīng)處理數(shù)據(jù),_global_ void kernel(.); dim3 DimGrid(3, 2); / 6 thread blocks dim3 DimBlock(16, 16); / 256 threads per block kernel (.);,實例1: Element-Wise Addition,/CPU program /sum of two vectors a and b void add_cpu(float *a, float *b, int N) for (int idx = 0; idxN; idx+) aidx += bidx; void main() . fun_add(a, b, N); ,/CUDA program /sum of two vectors a and b _global_ void add_gpu(float *a, float *b, int N) Int idx =blockIdx.x* blockDim.x+ threadIdx.x; if (idx (a, b, N); ,提綱,從GPGPU到CUDA CUDA并行程序組織 并行執(zhí)行模型 CUDA基礎(chǔ) CUDA存儲器 CUDA程序設(shè)計工具 程序優(yōu)化,CUDA Processing Flow,并行線程執(zhí)行,SM內(nèi)以(warp即32 threads)為單位并行執(zhí)行 Warp內(nèi)線程執(zhí)行同一條指令 Half-warp是存儲操作的基本單位,Warp,GPU負(fù)載分配,Global block scheduler 管理thread block級并行 從CPU獲得線程組織信息 根據(jù)硬件結(jié)構(gòu)分配thread block到SM,Streaming Multiprocessor (SM),Streaming Multiprocessor執(zhí)行Thread Blocks,線程以block為單位分配到SM 視資源需求, 一個SM分配至多8個block SM in G80可以接受768個線程 256 (threads/block) * 3 blocks 或128 (threads/block) * 6 blocks, etc. 線程并發(fā)(concurrently)運行 SM分配并維護線程ID SM管理并調(diào)度線程,Thread Life Cycle,Grid在GPU上啟動 Thread blocks順序分配到SMs 一般SM應(yīng)有1 thread block SM把線程組織為warps SM調(diào)度并執(zhí)行就緒的warp Warps和thread blocks 執(zhí)行結(jié)束后釋放資源 GPU繼續(xù)分發(fā)thread blocks,Example of Hiding Memory Latency,G80: 執(zhí)行warp全部線程的一條指令需要8個時鐘cycle 假定1 global memory access / 8 instructions A 400-cycle global memory latency How many warps are needed to tolerate the latency?,400 cycles * 1 MEM/ 8 cycles = 50 cycles per instruction on average 50 cycles / 4 cycles per warp = 12.5 13 warps to keep an SM busy,Arithmetic Instruction Throughput,4 clock cycles Single-precision floating-point add, multiply, and multiply-add, Integer add, 24-bit integer multiplication Bitwise operations, compare, min, max, type conversion instruction;,Note. 1. A warp is issued in 4 cycles 2. Arithmetic ops are pipelined. 3. Still possible to have 8 ops ready in each cycle.,Arithmetic Instruction Throughput,16 clock cycles Reciprocal, reciprocal square root, 32-bit Integer multiplication Other functions are combinations of the above y / x = rcp(x) * y /20 cycles per warp sqrt(x) = rcp(rsqrt(x) /32 cycles per warp Integer division and modulo operation are costly!,浮點數(shù)精度,GT200之前的GPU IEEE-754 Floating Point Standard 單精度浮點數(shù) GT200 增加雙精度浮點數(shù)支持 SP仍然只支持單精度浮點數(shù) 每個SM配置一個雙精度浮點單元 雙精度比單精度運算慢8-12倍,控制流(Control Flow),同一warp內(nèi)的分支語句可能執(zhí)行不同的指令路徑 不同指令路徑的線程只能順序執(zhí)行 每次執(zhí)行warp中一條可能的路徑 N條指令路徑1/N throughput 只需要考慮同一warp即可,不同warp的不同的指令路徑不具相關(guān)性 G80上使用指令預(yù)測技術(shù)加速指令執(zhí)行,控制流(Control Flow),常見情況: 分支條件是thread ID的函數(shù)時, 容易導(dǎo)致divergence Example with divergence: If (threadIdx.x 2) 在thread block產(chǎn)生兩條不同指令路徑 Branch granularity 2) 也在thread block產(chǎn)生兩條不同指令路徑 Branch granularity is a whole multiple of warp size 同一warp的所有線程具備相同指令路徑,線程同步,void _syncthreads(); Barrier synchronization 同步thread block之內(nèi)的所有線程 避免訪問共享內(nèi)存時發(fā)生RAW/WAR/WAW 冒險(hazard),_shared_ float scratch256; scratchthreadID = beginthreadID; _syncthreads(); int left = scratchthreadID -1;,在此等待,直至所有線程到達才開始執(zhí)行下面的代碼,Dead-Lock with _syncthreads,Dead-lock if Some threads have val larger than threshold And others not,_global_ void compute(.) / do some computation for val if( val threshold ) return; _syncthreads(); / work with val ,提綱,從GPGPU到CUDA CUDA并行程序組織 并行執(zhí)行模型 CUDA基礎(chǔ) CUDA存儲器 CUDA程序設(shè)計工具 程序優(yōu)化,CUDA擴展語言結(jié)構(gòu),Declspecs global, device, shared, local, constant Keywords threadIdx, blockIdx threadDim, blockDim Intrinsics _syncthreads Runtime API Memory, symbol, execution management Function launch,_device_ float filterN; _global_ void convolve (float *image) _shared_ float regionM; . regionthreadIdx = imagei; _syncthreads() . imagej = result; / Allocate GPU memory void *myimage = cudaMalloc(bytes) / 100 blocks, 10 threads per block foo (parameters);,存儲器空間,R/W per-thread registers 1-cycle latency R/W per-thread local memory Slow register spilling to global memory R/W per-block shared memory 1-cycle latency But bank conflicts may drag down R/W per-grid global memory 500-cycle latency But coalescing accessing could hide latency Read only per-grid constant and texture memories 500-cycle latency But cached,GPU Global Memory分配,cudaMalloc() 分配顯存中的global memory 兩個參數(shù) 對象數(shù)組指針和數(shù)組尺寸 cudaFree() 釋放顯存中的global memory 對象數(shù)組指針,int blk_sz = 64; float* Md; int size = blk_sz * blk_sz * sizeof(float); cudaMalloc(void*),Host Device數(shù)據(jù)交換,cudaMemcpy() Memory data transfer Requires four parameters Pointer to destination Pointer to source Number of bytes copied Type of transfer Host to Host, Host to Device, Device to Host, Device to Device,cudaMemcpy(Md, M.elements, size, cudaMemcpyHostToDevice); cudaMemcpy(M.elements, Md, size, cudaMemcpyDeviceToHost);,CUDA引入的新變量類型,_device_ 儲存于GPU上的global memory空間 和應(yīng)用程序具有相同的生命期(lifetime) 可被grid中所有線程存取, CPU代碼通過runtime函數(shù)存取 _constant_ 儲存于GPU上的constant memory空間 和應(yīng)用程序具有相同的生命期(lifetime) 可被grid中所有線程存取, CPU代碼通過runtime函數(shù)存取 _shared_ 儲存于GPU上thread block內(nèi)的共享存儲器 和thread block具有相同的生命期(lifetime) 只能被thread block內(nèi)的線程存取 Local變量 儲存于SM內(nèi)的寄存器和local memory 和thread具有相同的生命期(lifetime) Thread私有,CUDA函數(shù)定義,_global_ 定義kernel函數(shù) 必須返回void _device_ 函數(shù) 不能用&運算符取地址, 不支持遞歸調(diào)用, 不支持靜態(tài)變量(static variable), 不支持可變長度參數(shù)函數(shù)調(diào)用,CUDA數(shù)學(xué)函數(shù),pow, sqrt, cbrt, hypot, exp, exp2, expm1, log, log2, log10, log1p, sin, cos, tan, asin, acos, atan, atan2, sinh, cosh, tanh, asinh, acosh, atanh, ceil, floor, trunc, round, etc. 只支持標(biāo)量運算 許多函數(shù)有一個快速、較不精確的對應(yīng)版本 以”_”為前綴,如_sin() 編譯開關(guān)-use_fast_math強制生成該版本的目標(biāo)碼 每個多處理器包含兩個超越函數(shù)計算單元,實例2: 矩陣相乘,矩陣數(shù)據(jù)類型 不屬于CUDA! 單精度浮點數(shù) width height個元素 矩陣元素在elements中 1-D數(shù)組存放矩陣數(shù)據(jù) Row-major storage typedef struct int width; int height; float* elements; Matrix;,A,B,C,WM.width = N.heightI,M.height,M.width,N.width,實例2: 矩陣相乘,C = A B of size WIDTH x WIDTH 一個線程處理一個矩陣元素 簡化: 假定 WIDTH x WIDTH 512 只需要一個thread block 線程載入A的一行和B的一列 A和B的一對相應(yīng)元素作一次乘法和一次加法,A,B,C,WIDTH,WIDTH,WIDTH,WIDTH,CUDA Implementation Host Side,/ Matrix multiplication on the device void Mul(const Matrix A, const Matrix B, Matrix C) int size = A.width A.width sizeof(float); / Load M and N to the device float *Ad, *Bd, *Cd; cudaMalloc(void*),CUDA Implementation Host Side,/ Launch the device computation threads! dim3 dimGrid(1); dim3 dimBlock(M.width, M.width); Muld(Ad, Bd, Cd, M.width); / Read P from the device copyFromDeviceMatrix(C.elements, Cd); cudaMemCopy(C, Cd, N * size, cudaMemcpyDeviceToHost); / Free device matrices cudaFree(Ad); cudaFree(Bd); cudaFree(Cd); ,CUDA Implementation Kernel,/ Matrix multiplication kernel thread specification _global_ void Muld (float* Ad, float* Bd, float* Cd, int width) / 2D Thread ID int tx = threadIdx.x; int ty = threadIdx.y; / cvalue is used to store the element of the matrix / that is computed by the thread float cvalue = 0;,CUDA Implementation Kernel,A,B,C,WIDTH,WIDTH,WIDTH,WIDTH,ty,tx,for (int k = 0; k width; +k) float ae = Adty * width + k; float be = Bd tx + k * width; cvalue += ae * be; / Write the matrix to device memory; / each thread writes one element Cdty * width + tx = cvalue; ,提綱,從GPGPU到CUDA CUDA并行程序組織 并行執(zhí)行模型 CUDA存儲器 Shared memory Global memory CUDA程序設(shè)計工具 程序優(yōu)化,共享存儲器(Shared Memory),設(shè)置于streaming multiprocessor內(nèi)部 由一個線程塊內(nèi)部全部線程共享 完全由軟件控制 訪問一個地址只需要1個時鐘周期,共享存儲器結(jié)構(gòu),G80的共享存儲器組織為16 banks Addressed in 4 bytes Bank ID = 4-byte address % 16 相鄰4-byte地址映射相鄰banks 每一bank的帶寬為4 bytes per clock cycle 對同一bank的同時訪問導(dǎo)致bank conflict 只能順序處理 僅限于同一線程塊內(nèi)的線程,Bank Addressing實例,No Bank Conflicts Linear addressing stride = 1 (s=1),No Bank Conflicts Random 1:1 Permutation,_shared_ float shared256; float foo = sharedthreadIdx.x;,Bank Addressing實例,2-way bank conflicts Linear addressing stride = 2 (s=2),8-way bank conflicts Linear addressing stride = 8 (s=8),_shared_ float shared256; float foo = shared2 * threadIdx.x;,_shared_ float shared256; float foo = shared8 * threadIdx.x;,常見Bank Conflict模式,Shared memory存放2D浮點數(shù)組 16x16-elelment shared memory 1個線程處理矩陣的一行 循環(huán)處理一行16個元素 同一block的線程同時訪問一列 即 column 1 in purple 16-way bank conflicts,Bank Indices without Padding,Bank,t15,解決方案,方案1: pad the rows 在每行最后添加一個元素 方案 2: transpose before processing Suffer bank conflicts during transpose But possibly save them later,Bank Indices with Padding,Transpose,提綱,從GPGPU到CUDA CUDA并行程序組織 并行執(zhí)行模型 CUDA存儲器 Shared memory Global memory CUDA程序設(shè)計工具 程序優(yōu)化,全局內(nèi)存(Global Memory),全局內(nèi)存在G80上沒有緩存 Constant memory和texture memory有少量緩存 存取延時 400-600 clock cycles 非常容易成為性能瓶頸 優(yōu)化是提高性能的關(guān)鍵!,Coalesced Global Memory Accesses,在half-warp層次對訪問global memory進行協(xié)調(diào) 訪問連續(xù)global memory區(qū)域: 64 bytes - each thread reads a word: int, float, 128 bytes - each thread reads a double-word: int2, float2, 256 bytes each thread reads a quad-word: int4, float4, 額外限制: Global memory區(qū)域的起始地址必須是該區(qū)域數(shù)據(jù)類型尺寸的整數(shù)倍 Warp中第k個線程訪問第k個地址 例外: 可以有某些中間線程不參加 Predicated access, divergence within a warp,Coalesced Global Memory Accesses,Non-Coalesced Global Memory Accesses,Non-Coalesced Global Memory Accesses,提綱,從GPGPU到CUDA CUDA并行程序組織 并行執(zhí)行模型 CUDA存儲器 Shared memory Global memory CUDA程序設(shè)計工具 程序優(yōu)化,下載CUDA軟件,/object/cuda_get_cn.html CUDA driver 硬件驅(qū)動 CUDA toolkit 工具包 CUDA SDK 程序范例及動態(tài)鏈接庫 CUDA Visual Profiler 程序剖析工具,軟件環(huán)境,GPU硬件和CUDA軟件安裝后:,CPU (Host),CUDA Libraries (CUFFT & CUBLAS),CUDA Runtime Libraries,CUDA Driver,Application,GPU (Device),CUDA程序的編譯(compile),CUDA源文件被nvcc處理 nvcc is a compiler driver nvcc輸出: PTX (Parallel Thread eXecution) Virtual ISA for multiple GPU hardware Just-In-Time compilation by CUDA runtime GPU binary Device-specific binary object Standard C code With explicit parallelism,DEBUG,make dbg=1 CPU代碼以debug模式編譯 可以用debugger (e.g. gdb, visual studio)運行 但不能檢查GPU代碼的中間結(jié)果 make emu=1 在CPU上以emulation方式順序運行 可以使用printf()打印中間結(jié)果 基本順序執(zhí)行 但不能再現(xiàn)線程間的競爭(race)現(xiàn)象 浮點運算結(jié)果可能有微小的差別,檢查資源使用,使用-cubin flag編譯開關(guān) 檢查.cubin文件的”code”部分 architecture sm_10 abiversion 0 modname cubin code name = BlackScholesGPU lmem = 0 smem = 68 reg = 20 bar = 0 bincode 0xa0004205 0x04200780 0x40024c09 0x00200780 ,per thread local memory,per thread block shared memory,per thread registers,提綱,從GPGPU到CUDA CUDA并行程序組織 并行執(zhí)行模型 CUDA存儲器 Shared memory Global memory CUDA程序設(shè)計工具 程序優(yōu)化,針對GPU優(yōu)化算法,最大化獨立并行性 最大化算術(shù)計算密度(math/bandwidth) 重復(fù)計算往往優(yōu)于訪問存儲器 GPU spends its transistors on ALUs, not memory 盡量在GPU上計算以避免與CPU傳遞數(shù)據(jù) 即使低并行度運算也往往優(yōu)于頻繁的CPU-GPU數(shù)據(jù)傳遞,利用Shared Memory,幾百倍快于global memory 線程之間通過shared memory合作 使用一個或少量線程裝載和計算thread block內(nèi)全部線程共享的數(shù)據(jù) Use it to avoid non-coalesced access Stage loads and stores in shared memory to re-order non-coalesceable addressing,有效并行,劃分計算使得GPU各個SM負(fù)載均衡 Many threads, many thread blocks 降低資源使用, 以便多個thread blocks在SM上運行 Registers, shared memory,優(yōu)化存儲器訪問的一致性,全局存儲器延時: 400-600 clock cycles 經(jīng)常成為性能瓶頸 Coalesced vs. non-coalesced 優(yōu)化效果明顯 -數(shù)量級性能差別 Experiment: Kernel: read a float, increment, write back 3M floats (12MB), Times averaged over 10K runs 12K blocks x 256 threads: 356 s coalesced 357 s coalesced, some threads dont participate 3,494 s permuted/misaligned thread access 使用texture memory優(yōu)化空間局部行為 Spatial locality,Uncoalesced float3 Code,_global_ void accessFloat3(float3 *d_in, float3 d_out) int index = blockIdx.x * blockDim.x + threadIdx.x; float3 a = d_inindex; a.x += 2; a.y += 2; a.z += 2; d_outindex = a; ,Uncoalesced float3 Code,float3需要12 bytes: float3 f = d_inthreadIdx.x; Each thread ends up executing 3 reads sizeof(float3) 4, 8, or 16 Half-warp reads three 64B non-contiguous regions,Coalescing float3 Access,A 3-step approach (256 threads/block),Global Memory,Shared memory,Shared memory,Coalescing float3 Access,Use shared memory to allow coalescing 256 threads per block A thread block needs sizeof(float3)x256 bytes of SMEM Each thread reads 3 scalar floats: Offsets: 0, (threads/block), 2*(threads/block) These will likely be processed by other threads, so sync Processing Each thread retrieves its float3 from SMEM array Cast the SMEM pointer to (float3*) Use thread ID as index Rest of the compute code does not change!,Coalescing float3 Access代碼,Matrix Transpose,SDK Sample (“transpose”)解釋通過shared memory實現(xiàn)coalescing 在小尺度數(shù)據(jù)即可顯示優(yōu)化的明顯效果,Uncoalesced Transpose,_global_ void transpose_naive(float *odata, float *idata, int width, int height) unsigned int xIndex = blockDim.x * blockIdx.x + threadIdx.x; unsigned int yIndex = blockDim.y * blockIdx.y + threadIdx.y; if (xIndex width ,tx, ty,Height,Width,Height,Width,Uncoalesced Transpose,Coalesced Transpose,假設(shè): 矩陣已被分解為方塊(tile) Thread block (bx, by): Read the (bx, by) input tile, store into SMEM Write the SMEM data to (by, bx) output tile Thread (tx, ty): Reads element (tx, ty) from input tile Writes element (tx, ty) into output tile Coalescing is achieved if: Block/tile dimensions are multiples of 16,Coalesced Transpose,Coalesced Transpose,

溫馨提示

  • 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時也不承擔(dān)用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

評論

0/150

提交評論