Sharedmemory Programming.ppt_第1頁(yè)
Sharedmemory Programming.ppt_第2頁(yè)
Sharedmemory Programming.ppt_第3頁(yè)
Sharedmemory Programming.ppt_第4頁(yè)
Sharedmemory Programming.ppt_第5頁(yè)
已閱讀5頁(yè),還剩68頁(yè)未讀 繼續(xù)免費(fèi)閱讀

下載本文檔

版權(quán)說(shuō)明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡(jiǎn)介

1、Parallel Programmingin C with MPI and OpenMP,Michael J. Quinn,Chapter 17,Shared-memory Programming,Outline,OpenMP Shared-memory model Parallel for loops Declaring private variables Critical sections Reductions Performance improvements More general data parallelism Functional parallelism,OpenMP,OpenM

2、P: An application programming interface (API) for parallel programming on multiprocessors Compiler directives Library of support functions OpenMP works in conjunction with Fortran, C, or C+,Whats OpenMP Good For?,C + OpenMP sufficient to program multiprocessors C + MPI + OpenMP a good way to program

3、 multicomputers built out of multiprocessors IBM RS/6000 SP Fujitsu AP3000 Dell High Performance Computing Cluster,Shared-memory Model,Processors interact and synchronize with each other through shared variables.,Fork/Join Parallelism,Initially only master thread is active Master thread executes seq

4、uential code Fork: Master thread creates or awakens additional threads to execute parallel code Join: At end of parallel code created threads die or are suspended,Fork/Join Parallelism,Shared-memory Model vs.Message-passing Model (#1),Shared-memory model Number active threads 1 at start and finish o

5、f program, changes dynamically during execution Message-passing model All processes active throughout execution of program,Incremental Parallelization,Sequential program a special case of a shared-memory parallel program Parallel shared-memory programs may only have a single parallel loop Incrementa

6、l parallelization: process of converting a sequential program to a parallel program a little bit at a time,Shared-memory Model vs.Message-passing Model (#2),Shared-memory model Execute and profile sequential program Incrementally make it parallel Stop when further effort not warranted Message-passin

7、g model Sequential-to-parallel transformation requires major effort Transformation done in one giant step rather than many tiny steps,Parallel for Loops,C programs often express data-parallel operations as for loops for (i = first; i size; i += prime) markedi = 1; OpenMP makes it easy to indicate wh

8、en the iterations of a loop may execute in parallel Compiler takes care of generating code that forks/joins threads and allocates the iterations to threads,Pragmas,Pragma: a compiler directive in C or C+ Stands for “pragmatic information” A way for the programmer to communicate with the compiler Com

9、piler free to ignore pragmas Syntax: #pragma omp ,Parallel for Pragma,Format: #pragma omp parallel for for (i = 0; i N; i+) ai = bi + ci; Compiler must be able to verify the run-time system will have information it needs to schedule loop iterations,Canonical Shape of for Loop Control Clause,Executio

10、n Context,Every thread has its own execution context Execution context: address space containing all of the variables a thread may access Contents of execution context: static variables dynamically allocated data structures in the heap variables on the run-time stack additional run-time stack for fu

11、nctions invoked by the thread,Shared and Private Variables,Shared variable: has same address in execution context of every thread Private variable: has different address in execution context of every thread A thread cannot access the private variables of another thread,Shared and Private Variables,F

12、unction omp_get_num_procs,Returns number of physical processors available for use by the parallel program int omp_get_num_procs (void),Function omp_set_num_threads,Uses the parameter value to set the number of threads to be active in parallel sections of code May be called at multiple points in a pr

13、ogram void omp_set_num_threads (int t),Pop Quiz:,Write a C program segment that sets the number of threads equal to the number of processors that are available.,Declaring Private Variables,for (i = 0; i BLOCK_SIZE(id,p,n); i+) for (j = 0; j n; j+) aij = MIN(aij,aik+tmp); Either loop could be execute

14、d in parallel We prefer to make outer loop parallel, to reduce number of forks/joins We then must give each thread its own private copy of variable j,private Clause,Clause: an optional, additional component to a pragma Private clause: directs compiler to make one or more variables private private (

15、),Example Use of private Clause,#pragma omp parallel for private(j) for (i = 0; i BLOCK_SIZE(id,p,n); i+) for (j = 0; j n; j+) aij = MIN(aij,aik+tmp);,firstprivate Clause,Used to create private variables having initial values identical to the variable controlled by the master thread as the loop is e

16、ntered Variables are initialized once per thread, not once per loop iteration If a thread modifies a variables value in an iteration, subsequent iterations will get the modified value,lastprivate Clause,Sequentially last iteration: iteration that occurs last when the loop is executed sequentially la

17、stprivate clause: used to copy back to the master threads copy of a variable the private copy of the variable from the thread that executed the sequentially last iteration,Critical Sections,double area, pi, x; int i, n; . area = 0.0; for (i = 0; i n; i+) x += (i+0.5)/n; area += 4.0/(1.0 + x*x); pi =

18、 area / n;,Race Condition,Consider this C program segment to compute using the rectangle rule:,double area, pi, x; int i, n; . area = 0.0; for (i = 0; i n; i+) x = (i+0.5)/n; area += 4.0/(1.0 + x*x); pi = area / n;,Race Condition (cont.),If we simply parallelize the loop.,double area, pi, x; int i,

19、n; . area = 0.0; #pragma omp parallel for private(x) for (i = 0; i n; i+) x = (i+0.5)/n; area += 4.0/(1.0 + x*x); pi = area / n;,Race Condition (cont.),. we set up a race condition in which one process may “race ahead” of another and not see its change to shared variable area,11.667,area,area += 4.0

20、/(1.0 + x*x),15.432,11.667,11.667,15.432,15.230,15.230,Answer should be 18.995,Race Condition Time Line,critical Pragma,Critical section: a portion of code that only thread at a time may execute We denote a critical section by putting the pragma#pragma omp criticalin front of a block of C code,Corre

21、ct, But Inefficient, Code,double area, pi, x; int i, n; . area = 0.0; #pragma omp parallel for private(x) for (i = 0; i n; i+) x = (i+0.5)/n; #pragma omp critical area += 4.0/(1.0 + x*x); pi = area / n;,Source of Inefficiency,Update to area inside a critical section Only one thread at a time may exe

22、cute the statement; i.e., it is sequential code Time to execute statement significant part of loop By Amdahls Law we know speedup will be severely constrained,Reductions,Reductions are so common that OpenMP provides support for them May add reduction clause to parallel for pragma Specify reduction o

23、peration and reduction variable OpenMP takes care of storing partial results in private variables and combining partial results after the loop,reduction Clause,The reduction clause has this syntax:reduction ( :) Operators +Sum *Product int i, n; . area = 0.0; #pragma omp parallel for private(x) redu

24、ction(+:area) for (i = 0; i n; i+) x = (i + 0.5)/n; area += 4.0/(1.0 + x*x); pi = area / n;,Performance Improvement #1,Too many fork/joins can lower performance Inverting loops may help performance if Parallelism is in inner loop After inversion, the outer loop can be made parallel Inversion does no

25、t significantly lower cache hit rate,Performance Improvement #2,If loop has too few iterations, fork/join overhead is greater than time savings from parallel execution The if clause instructs compiler to insert code that determines at run-time whether loop should be executed in parallel; e.g.,#pragm

26、a omp parallel for if(n 5000),Performance Improvement #3,We can use schedule clause to specify how iterations of a loop should be allocated to threads Static schedule: all iterations allocated to threads before any iterations executed Dynamic schedule: only some iterations allocated to threads at be

27、ginning of loops execution. Remaining iterations allocated to threads that complete their assigned iterations.,Static vs. Dynamic Scheduling,Static scheduling Low overhead May exhibit high workload imbalance Dynamic scheduling Higher overhead Can reduce workload imbalance,Chunks,A chunk is a contigu

28、ous range of iterations Increasing chunk size reduces overhead and may increase cache hit rate Decreasing chunk size allows finer balancing of workloads,schedule Clause,Syntax of schedule clauseschedule (, ) Schedule type required, chunk size optional Allowable schedule types static: static allocati

29、on dynamic: dynamic allocation guided: guided self-scheduling runtime: type chosen at run-time based on value of environment variable OMP_SCHEDULE,Scheduling Options,schedule(static): block allocation of about n/t contiguous iterations to each thread schedule(static,C): interleaved allocation of chu

30、nks of size C to threads schedule(dynamic): dynamic one-at-a-time allocation of iterations to threads schedule(dynamic,C): dynamic allocation of C iterations at a time to threads,Scheduling Options (cont.),schedule(guided, C): dynamic allocation of chunks to tasks using guided self-scheduling heuris

31、tic. Initial chunks are bigger, later chunks are smaller, minimum chunk size is C. schedule(guided): guided self-scheduling with minimum chunk size 1 schedule(runtime): schedule chosen at run-time based on value of OMP_SCHEDULE; Unix example:setenv OMP_SCHEDULE “static,1”,More General Data Paralleli

32、sm,Our focus has been on the parallelization of for loops Other opportunities for data parallelism processing items on a “to do” list for loop + additional code outside of loop,Processing a “To Do” List,Sequential Code (1/2),int main (int argc, char *argv) struct job_struct *job_ptr; struct task_str

33、uct *task_ptr; . task_ptr = get_next_task ( . ,Sequential Code (2/2),char *get_next_task(struct job_struct *job_ptr) struct task_struct *answer; if (*job_ptr = NULL) answer = NULL; else answer = (*job_ptr)-task; *job_ptr = (*job_ptr)-next; return answer; ,Parallelization Strategy,Every thread should

34、 repeatedly take next task from list and complete it, until there are no more tasks We must ensure no two threads take same take from the list; i.e., must declare a critical section,parallel Pragma,The parallel pragma precedes a block of code that should be executed by all of the threads Note: execu

35、tion is replicated among all threads,Use of parallel Pragma,#pragma omp parallel private(task_ptr) task_ptr = get_next_task ( ,Critical Section for get_next_task,char *get_next_task(struct job_struct *job_ptr) struct task_struct *answer; #pragma omp critical if (*job_ptr = NULL) answer = NULL; else

36、answer = (*job_ptr)-task; *job_ptr = (*job_ptr)-next; return answer; ,Functions for SPMD-style Programming,The parallel pragma allows us to write SPMD-style programs In these programs we often need to know number of threads and thread ID number OpenMP provides functions to retrieve this information,

37、Function omp_get_thread_num,This function returns the thread identification number If there are t threads, the ID numbers range from 0 to t-1 The master thread has ID number 0int omp_get_thread_num (void),Function omp_get_num_threads,Function omp_get_num_threads returns the number of active threads

38、If call this function from sequential portion of program, it will return 1 int omp_get_num_threads (void),for Pragma,The parallel pragma instructs every thread to execute all of the code inside the block If we encounter a for loop that we want to divide among threads, we use the for pragma#pragma om

39、p for,Example Use of for Pragma,#pragma omp parallel private(i,j) for (i = 0; i high) printf (Exiting (%d)n, i); break; #pragma omp for for (j = low; j high; j+) cj = (cj - ai)/bi; ,single Pragma,Suppose we only want to see the output once The single pragma directs compiler that only a single thread

40、 should execute the block of code the pragma precedes Syntax: #pragma omp single,Use of single Pragma,#pragma omp parallel private(i,j) for (i = 0; i high) #pragma omp single printf (Exiting (%d)n, i); break; #pragma omp for for (j = low; j high; j+) cj = (cj - ai)/bi; ,nowait Clause,Compiler puts a

41、 barrier synchronization at end of every parallel for statement In our example, this is necessary: if a thread leaves loop and changes low or high, it may affect behavior of another thread If we make these private variables, then it would be okay to let threads move ahead, which could reduce executi

42、on time,Use of nowait Clause,#pragma omp parallel private(i,j,low,high) for (i = 0; i high) #pragma omp single printf (Exiting (%d)n, i); break; #pragma omp for nowait for (j = low; j high; j+) cj = (cj - ai)/bi; ,Functional Parallelism,To this point all of our focus has been on exploiting data para

43、llelism OpenMP allows us to assign different threads to different portions of code (functional parallelism),Functional Parallelism Example,v = alpha(); w = beta(); x = gamma(v, w); y = delta(); printf (%6.2fn, epsilon(x,y);,May execute alpha, beta, and delta in parallel,parallel sections Pragma,Precedes a block of k blocks of code that may be execut

溫馨提示

  • 1. 本站所有資源如無(wú)特殊說(shuō)明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
  • 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
  • 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁(yè)內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒(méi)有圖紙預(yù)覽就沒(méi)有圖紙。
  • 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
  • 5. 人人文庫(kù)網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
  • 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
  • 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

評(píng)論

0/150

提交評(píng)論