版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請進行舉報或認領(lǐng)
文檔簡介
1、 David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,1,2008 Taiwan CUDA CourseProgramming Massively Parallel Processors:the CUDA experience Lecture 8: Application Case Study - Quantitative MRI Reconstruction, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,2,Ackno
2、wledgements,Sam S. Stone, Haoran Yi, Justin P. Haldar, Wen-mei W. Hwu, Bradley P. Sutton, Zhi-Pei Liang, Keith Thulburin*,Center for Reliable and High-Performance Computing, Beckman Institute for Advanced Science and Technology,Department of Electrical and Computer Engineering University of Illinois
3、 at Urbana-Champaign * University of Illinois, Chicago Medical Center, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,3,Overview,Magnetic resonance imaging Least-squares (LS) reconstruction algorithm Optimizing the LS reconstruction on the G80 Overcoming bottlenecks Performance
4、tuning Summary,1, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,4,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding,FFT,LS,2,Cartesian scan data + FFT: Slow scan, fast reconstruction, images may be poor, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30
5、- July 2, 2008,5,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding1,FFT,LS,Spiral scan data + Gridding + FFT: Fast scan, fast reconstruction, better images,2,1 Based on Fig 1 of Lustig et al, Fast Spiral Fourier Transform for Iterative MR Image Reconstruction, IEEE Intl Symp. on
6、 Biomedical Imaging, 2004, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,6,Reconstructing MR Images,Cartesian Scan Data,Spiral Scan Data,Gridding,FFT,Least-Squares (LS),Spiral scan data + LS Superior images at expense of significantly more computation,2, David Kirk/NVIDIA and W
7、en-mei W. Hwu Taiwan, June 30 - July 2, 2008,7,An Exciting Revolution - Sodium Map of the Brain,Images of sodium in the brain Requires powerful scanner (9.4 Tesla) Very large number of samples for increased SNR Requires high-quality reconstruction Enables study of brain-cell viability before anatomi
8、c changes occur in stroke and cancer treatment within days!,Courtesy of Keith Thulborn and Ian Atkinson, Center for MR Research, University of Illinois at Chicago, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,8,Least-Squares Reconstruction,Compute Q = FHF,Acquire Data,Compute
9、FHd,Find ,Q depends only on scanner configuration FHd depends on scan data found using linear solver Accelerate Q and FHd on G80 Q: 1-2 days on CPU FHd: 6-7 hours on CPU : 1.5 minutes on CPU,5, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,9,Algorithms to Accelerate,for (m = 0;
10、 m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,Compute Q,FHd is nearly identical Scan data M = # scan points kx, ky, kz = 3D scan data Pixel data N = # pixels x, y, z = input 3D pixel data Q = output pi
11、xel data Complexity is O(MN) Inner loop 10 FP MUL or ADD ops 2 FP trig ops 10 loads,6, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,10,From C to CUDA: Step 1What unit of work is assigned to each thread?,7,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n
12、+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) , David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,11,8,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += p
13、him*sin(exp) ,How does loop interchange help?,for (n = 0; n N; n+) for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?, David Kirk/NVIDIA and Wen-mei
14、 W. Hwu Taiwan, June 30 - July 2, 2008,12,9,for (n = 0; n N; n+) for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,How does loop fission help?,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhim for (n = 0; n N; n
15、+) for (m = 0; m M; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,13,10,for (m = 0; m M; m+) phim = rPhim*rPhim + iPhim*iPhi
16、m for (n = 0; n N; n+) for (m = 0; m M; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,From C to CUDA: Step 1What unit of work is assigned to each thread?,phi kernel Each thread computes phi at one scan point (each thread corresponds to one loop iteration),Q ker
17、nel Each thread computes Q at one pixel (each thread corresponds to one outer loop iteration), David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,14,Tiling of Scan Data,LS recon uses multiple grids Each grid operates on all pixels Each grid operates on a distinct subset of scan data
18、 Each thread in the same grid operates on a distinct pixel,for (m = 0; m M/32; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,12,Thread n operates on pixel n:, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,15,LS recon uses multiple grids Ea
19、ch grid operates on all pixels Each grid operates on a distinct subset of scan data Each thread in the same grid operates on a distinct pixel,Tiling of Scan Data,12,for (m = 31M/32; m 32M/32; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim*cos(exp) iQn += phim*sin(exp) ,Thread n operates on pi
20、xel n:, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,16,13,Q(float* x,y,z,rQ,iQ,kx,ky,kz,phi, int startM,endM) n = blockIdx.x*TPB + threadIdx.x for (m = startM; m endM; m+) exp = 2*PI*(kxm*xn + kym*yn + kzm*zn) rQn += phim * cos(exp) iQn += phim * sin(exp) ,From C to CUDA: Ste
21、p 2Where are the potential bottlenecks?,Bottlenecks Memory BW Trig ops Overheads (branches, addr calcs), David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,17,Step 3: Overcoming bottlenecks,14,LS recon on CPU (SP) Q: 45 hours, 0.5 GFLOPS FHd: 7 hours, 0.7 GFLOPS Counting each trig o
22、p as 1 FLOP, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,18,Step 3: Overcoming Bottlenecks (Mem BW),15,Register allocate pixel data Inputs (x, y, z); Outputs (rQ, iQ) Exploit temporal and spatial locality in access to scan data Constant memory + constant caches Shared memory,
23、 David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,19,Step 3: Overcoming Bottlenecks (Mem BW),Register allocation of pixel data Inputs (x, y, z); Outputs (rQ, iQ) FP arithmetic to off-chip loads: 2 to 1 Performance 5.1 GFLOPS (Q), 5.4 GFLOPS (FHd) Still bottlenecked on memory BW,16
24、, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,20,Step 3: Overcoming Bottlenecks (Mem BW),Old bottleneck: off-chip BW Solution: constant memory FP arithmetic to off-chip loads: 284 to 1 Performance 18.6 GFLOPS (Q), 22.8 GFLOPS (FHd) New bottleneck: trig operations,17, David Ki
25、rk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,21,Sidebar: Estimating Off-Chip Loads with Const Cache,How can we approximate the number of off-chip loads when using the constant caches? Given: 128 tpb, 4 blocks per SM, 256 scan points per grid Assume no evictions due to cache conflicts
26、7 accesses to global memory per thread (x, y, z, rQ x 2, iQ x 2) 4 blocks/SM * 128 threads/block * 7 accesses/thread = 3,584 global mem accesses 4 accesses to constant memory per scan point (kx, ky, kz, phi) 256 scan points * 4 loads/point = 1,024 constant mem accesses Total off-chip memory accesses
27、 = 3,584 + 1,024 = 4,608 Total FP arithmetic ops = 4 blocks/SM * 128 threads/block * 256 iters/thread * 10 ops/iter = 1,310,720 FP arithmetic to off-chip loads: 284 to 1,18, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,22,Step 3: Overcoming Bottlenecks (Trig),Old bottleneck: t
28、rig operations Solution: SFUs Performance 98.2 GFLOPS (Q), 92.2 GFLOPS (FHd) New bottleneck: overhead of branches and address calculations,19, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,23,Sidebar: Effects of Approximations,Avoid temptation to measure only absolute error (I0
29、 I) Can be deceptively large or small Metrics PSNR: Peak signal-to-noise ratio SNR: Signal-to-noise ratio Avoid temptation to consider only the error in the computed value Some apps are resistant to approximations; others are very sensitive,20,A.N. Netravali and B.G. Haskell, Digital Pictures: Repre
30、sentation, Compression, and Standards (2nd Ed), Plenum Press, New York, NY (1995)., David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,24,Step 3: Overcoming Bottlenecks (Overheads),Old bottleneck: Overhead of branches and address calculations Solution: Loop unrolling and experimenta
31、l tuning Performance 179 GFLOPS (Q), 145 GFLOPS (FHd),21, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,25,Experimental Tuning: Tradeoffs,In the Q kernel, three parameters are natural candidates for experimental tuning Loop unrolling factor (1, 2, 4, 8, 16) Number of threads pe
32、r block (32, 64, 128, 256, 512) Number of scan points per grid (32, 64, 128, 256, 512, 1024, 2048) Cant optimize these parameters independently Resource sharing among threads (register file, shared memory) Optimizations that increase a threads performance often increase the threads resource consumpt
33、ion, reducing the total number of threads that execute in parallel Optimization space is not linear Threads are assigned to SMs in large thread blocks Causes discontinuity and non-linearity in the optimization space,22, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,26,Experimen
34、tal Tuning: Example,Increase in per-thread performance, but fewer threads: Lower overall performance,23, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,27,Experimental Tuning: Scan Points Per Grid,24, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,28,Sidebar
35、: Cache-Conscious Data Layout,kx, ky, kz, and phi components of same scan point have spatial and temporal locality Prefetching Caching Old layout does not fully leverage that locality New layout does fully leverage that locality,25, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008
36、,29,Experimental Tuning: Scan Points Per Grid (Improved Data Layout),26, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,30,Experimental Tuning: Loop Unrolling Factor,27, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,31,Sidebar: Optimizing the CPU Implementa
37、tion,Optimizing the CPU implementation of your application is very important Often, the transformations that increase performance on CPU also increase performance on GPU (and vice-versa) The research community wont take your results seriously if your baseline is crippled Useful optimizations Data ti
38、ling SIMD vectorization (SSE) Fast math libraries (AMD, Intel) Classical optimizations (loop unrolling, etc) Intel compiler (icc, icpc),28, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,32,Summary of Results,29,8X, David Kirk/NVIDIA and Wen-mei W. Hwu Taiwan, June 30 - July 2, 2008,33,Summ
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲空間,僅對用戶上傳內(nèi)容的表現(xiàn)方式做保護處理,對用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對任何下載內(nèi)容負責。
- 6. 下載文件中如有侵權(quán)或不適當內(nèi)容,請與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。
最新文檔
- 2026年金堂檢察招聘備考題庫及答案詳解(易錯題)
- 河北吳橋雜技藝術(shù)學(xué)校2026年度高層次人才選聘的備考題庫參考答案詳解
- 2026年上海市楊浦區(qū)復(fù)旦大學(xué)經(jīng)濟學(xué)院招聘經(jīng)濟學(xué)院專業(yè)碩士研究生教育行政管理崗位的備考題庫及答案詳解參考
- 成都大學(xué)附屬醫(yī)院2025年公開考核招聘高層次人才備考題庫及答案詳解參考
- 2026年重慶人才服務(wù)股份有限公司派遣至重慶輪船(集團)有限公司交運游輪分公司招聘備考題庫附答案詳解
- 2026年校園招聘深圳市羅外教育集團校園招聘備考題庫及參考答案詳解一套
- 2026年中國雄安集團有限公司校園招聘備考題庫及答案詳解(新)
- 2026年德州一地事業(yè)單位招聘備考題庫及完整答案詳解1套
- 江蘇省沿海開發(fā)集團有限公司所屬企業(yè)2025年度長期招聘備考題庫及一套參考答案詳解
- 2026年重慶機床(集團)有限責任公司磐聯(lián)傳動科技分公司招聘6人備考題庫及一套參考答案詳解
- 2025侵襲性肺真菌病診斷與治療指南解讀課件
- 服裝店入股協(xié)議合同
- 化工單位電氣安全管理制度(3篇)
- 離退休工作培訓(xùn)指南
- 新教版小學(xué)三年級科學(xué)上冊期末試卷
- 四年級上冊道德與法治期末試卷及答案
- 村文書考試題及答案2025
- 游泳教練資格證(社會指導(dǎo)游泳與公共理論)考試題庫及答案
- 近年中考真題《出師表》36套
- 青光眼病的課件
- 【《1000噸年產(chǎn)量的鼠李糖脂生產(chǎn)工藝設(shè)計》9600字(論文)】
評論
0/150
提交評論