高性能計算習題及問題詳解

上傳人：w*** IP屬地：天津上傳時間：2022-07-31 格式：DOCX 頁數(shù)：18 大?。?65.48KB 積分：18 舉報 版權申訴

已閱讀5頁，還剩13頁未讀，繼續(xù)免費閱讀

版權說明：本文檔由用戶提供并上傳，收益歸屬內容提供方，若內容存在侵權，請進行舉報或認領

文檔簡介

1、實用文檔高性能計算練習題1、一下哪種編程方式適合在單機內并行？哪種適合在多機間并行？單機：Threading 線程、OpenMp 多機：MPI。2、例題：HPC1群的峰值計算能力：一套配置256個雙路X5670處理器計算節(jié)點的 HPCB群。X5560:2.93GHz Intel XS5670 Westmere 六核處理器，目前主流的Intel 處理器每時鐘周期提供4個雙精度浮點計算。峰值計算性能：2.93GHz*4Flops/Hz*6Core*2CPU*256 節(jié)點=36003.8GFlops。 Gflops=10 億次，所以36003Gflops=36.003TFlops=36.003

2、萬億次每秒的峰值性能。3、Top500排名的依據(jù)是什么？ High Performance Linpack(HPL) 測試結果4、目前最流行的GPUFF發(fā)環(huán)境是什么？CUDA5、一套配置200TFlops的HPC集群，如果用雙路 2.93GHz Intel westmere六核處理器 X5670來構建，需要用多少個計算節(jié)點？計算節(jié)點數(shù)=200TFlops/(2*2.93GHz*6*4Flops/Hz)=14226、天河1A參與TOP500排名的實測速度是多少，效率是多少？2.57PFlops 55%7、 RDM&口何實現(xiàn)？RDMA(Remote Direct Memory Access),數(shù)據(jù)

3、發(fā)送接收時，不用將數(shù)據(jù)拷貝到緩沖區(qū)中，而直接將數(shù)據(jù)發(fā)送到對方。繞過了核心，實現(xiàn)了零拷貝。8、InfiniBand的最低通訊延遲是多少？1-1.3usec MPI end-to-end ,0.9-1us InfiniBand latency for RDMA operations9、GPU-Direct如何加速應用程序運行速度？通過除去InfiniBand 和GPU間的內存拷貝來加速程序運行。? GPUs provide cost effective way for building supercomputers【GPUs提供高效方式建立超級計算機】? Dense packaging of

4、compute flops with high memory bandwidth【使用高端內存帶寬的密級封裝浮點計算】10、網絡設備的哪個特性決定了MPI_Allreduce 性能？集群大小，Time for MPI_Allreduce keeps increasing ascluster size scales ,也就是說集群的規(guī)模決定了MPI_Allreduce 的性能。11、現(xiàn)排名世界第一的超級計算機的運行速度？K computer: 10PFlops 也就是10千萬億次，93%12、以下哪些可以算作是嵌入式設備：A路由器B機器人C微波爐D筆記本電腦13、選擇嵌入式操作系統(tǒng)的頭兩個因素是

5、：A 成本B售后服務C可獲得源代碼D相關社區(qū)E開發(fā)工具14、構建嵌入式Linux的主要挑戰(zhàn)是：A需要廣博的知識面 B深度定制的復雜性 C日益增加的維護成本D穩(wěn)定性與安全性 E開源項目通常質量低下15、The Yocto Project 的主要目的是：A.構建一個統(tǒng)一的嵌入式Linux社區(qū)B.提供高質量的工具幫助你輕松構建嵌入式Linux ,從而專注于其上的研究工作C.包括一組經過測試的metadata ,指導最核心的一些開源項目的交叉編譯過程D.提供靈活的擴展接口，可以方便的導入新的項目，或是新的板級支持包(BSP)16、請描述交叉編譯一個開源項目需要完成哪些工作？Patch-Configur

6、e-Compile-Install-Sysroot-Package-Do_rootfs. Top500排名的依據(jù)是什么？答：High Performance Linpack(HPL)測試結果.Write codes to create a thread to compute the sum of the elements of an array.答： Create a thread to complete the sum of the elements of an array.實用文檔struct arguments double *array;int size;double *sum;)in

7、t main(int argc, char *argv) double array100;double sum;pthread_t worker_thread;struct arguments *arg;arg = (struct arguments *)calloc(1,sizeof(struct arguments);arg-array = array;arg-size=100;arg-sum = if (pthread_create(&worker_thread, NULL, do_work, (void *)arg) fprintf(stderr, Error while creati

8、ng threadn );exit(1);).)void *do_work(void *arg) struct arguments *argument;int i, size;double *array;double *sum;argument = (struct arguments*)arg;size = argument-size;array = argument-array;sum = argument-sum;*sum = 0;for (i=0;i pcomputing sum sAssignmentthread k sums sk = f (Ak*n/p) + f(A(k+1)*n/

9、p-1)(3)(4)thread 1 sums s = s1+ + sp (for simplicity of this example)thread 1 communicates s to other threadsOrchestrationstarting up threadscommunicating, synchronizing with thread 1Mapping實用文檔processorj runs thread jMFlops : Millions of floating point operations /sec POSIX : Portable Operating Sys

10、tem Interface of Unix可移植操作系統(tǒng)接口. Thread 線程：可作為獨立單元被調度的一連串代碼。(process進程).編寫多線程代碼時要注意的問題(1)負載均衡(2)正確的存取共享變量(通過互斥代碼或互斥鎖實現(xiàn)) 35.用戶級線程：多對一映射。不需要系統(tǒng)支持，操作開銷小。一個線程阻塞時其他線程也要阻塞。內核級線程：一對一映射。每個內核線程調度相互獨立，OS完成線程的操作。在一個處理器上每個內核線程可并行執(zhí)行，一個線程阻塞時其他線程也可以被調度。線程調度開銷大，O酸適應線程數(shù)目的變化。.多線程pthread_t : the type of a threadpthread_

11、create() : creates a threadpthread_mutex_t : the type of a mutex lockpthread_mutex_lock() : lock a mutexpthread_self() : Returns the thread identifier for the calling threadint pthread_create ( pthread_t *thread , pthread_attr_t *attr, void * (*start_routine) (void *) , void *arg);(1)計算數(shù)組元素之和 struct

12、 arguments (double *array; int size;double *sum; ) int main(int argc, char *argv) ( double array100;double sum;pthread_t worker_thread;struct arguments *arg;arg = (struct arguments *)calloc(1,sizeof(struct arguments);arg-array = array;arg-size=100; arg-sum = if (pthread_create(&worker_thread, NULL ,

13、 do_work, (void *)arg) (fprintf(stderr, Error while creating threadn ); exit(1);) . )void *do_work(void *arg) 實用文檔struct arguments *argument;int i, size;double *array;double *sum;argument = (struct arguments*)arg;size = argument-size;array = argument-array;sum = argument-sum;*sum = 0;for (i=0;iarray

14、 = array;arg-size=100;arg-sum = if (pthread_create(&worker_thread, NULL , do_work, (void *)arg) fprintf(stderr, Error while creating threadn );exit(1);.if (pthread_join(worker_thread, &return_value) fprintf(stderr, Error while waiting for threadn);exit(1); RDMA Remote Direct Memory Access,遠程直接存儲器存儲，

15、通過？ Zero-copy 和？ Kernel bypass 技術實現(xiàn)。37. InfiniBand的最低通訊延遲是多少？高吞吐率(40Gb/s點對點和120Gb/s連接；消息傳遞接近 90M/s ;發(fā)送接收和 RDM展作通過0復制)，低延遲 (11.3usec MPI 端對端；RDMA0.91us Infiniband 延遲)實用文檔InfiniBand - Highest Performance* Highest throughput-40GHs node to node and 120Gb/s switch to switch-Nearly 90M MRI messages per s

16、econd-Send/receive and RDMA operations with zero-copy Lowest latency-1-1.3usec MPI end-to-end-09-1 us InfiniBand latency for RDMA operations Lowest CPU overhead-Full transport offload maximizes CPU availability for user applications38.計算科學與理論科學和實驗科學是人類認識自然的三大支柱。應用領域：美國HPCC十劃，包括：磁記錄技術、新藥設計、高速民航、催化作用、

17、燃料燃燒、海洋建模、臭氧損耗、數(shù)字解析、大氣污染、蛋白質結構設計、圖像理解、密碼破譯。HPC衡量單位：十億次 Gflop/s ,萬億次Tflop/s ,千萬億次 Pflop/s 。Linpack是國際上最流行的用于測試高性能計算機系統(tǒng)浮點性能的benchmark。通過對高性能計算機采用高斯消元法求解一元 N次稠密線性代數(shù)方程組的測試，評價高性能計算機的浮點性能。共享存儲對稱多處理機系統(tǒng) (SMP, Shared Memory Processor),任意處理器可直接訪問任意內存地址，且訪問延遲、帶寬、幾率都是等價的；系統(tǒng)是對稱的。Cluster 集群：將多個計算機系統(tǒng)通過網絡連接起來如同一個系統(tǒng)

18、一樣提供服務，可以獲得高并行處理能力、高可用性、負載均衡和管理便捷性。Cluster技術進步的必然：高性能處理器、高速網絡、集群OS和管理系統(tǒng)、并行/分布計算工具以及軟件。并行計算 Parallel computing:單一系統(tǒng)，眾核處理同一任務；分布式計算Distributed computing:將多系統(tǒng)用調度器松散的結合起來，處理相關任務；網格計算Grid Computing:用軟件和網絡將多系統(tǒng)和多處理器緊耦合，共同處理同一任務或相關任務。并行計算的兩大優(yōu)勢：處理器總體性能更強，總體內存更大。并行式計算的分類：1) shared memory (共享內存)，可以分為統(tǒng)一內存訪問Un

19、iform memoryaccess (UMA)即所有處理器訪存相同和Non-uniform memory access (NUMA) 訪存延遲取決于數(shù)據(jù)存儲位置；2) distributedmemory (分布式內存)?？煞譃榇笠?guī)模并行處理器Massively Parallel Processor (MPP) 和集群Cluster 。對稱多處理器SMP與全局內存通過總線或交叉開關crossbar互聯(lián)。優(yōu)點編程模型簡單，問題總線帶寬會飽和，交叉開關規(guī)模會隨處理器個數(shù)增加而增大。缺點不宜擴展，限制了SMPW規(guī)模。集群優(yōu)勢：通用的高性能、高可用性、高擴展性和高性價比。 TOC o 1-5 h z

20、分布式內存編程模型：MPI共享內存編程模型：OpenMP Pthreads并行粒度：PVM/MPL Threads Compilers、CPU消息傳遞是當前并行計算領域的一個非常重要的并行程序設計方式MPI是一個庫，而不是一門語言；MPI是一種消息傳遞編程模型，是提供一個實際可用的、可移植的、高效的和靈活的消息傳遞接口標準.消息傳送機制：阻塞方式，必須等到消息從本地送出之后才可以執(zhí)行后續(xù)的語句；非阻塞方式，不須等到消息從本地送出就可以執(zhí)行后續(xù)的語句，并不保證資源的可再用性。并行加速的木桶理論：一個給定問題中的并行加速比受此問題的串行部分限制。對于并行計算來說，最危險的缺陷就是將一個計算問題變

21、成了一個通信問題：這種問題一般發(fā)生在各個節(jié)實用文檔點為了保持同步而傳輸數(shù)據(jù)的時間超過了CPU進行計算的時間，常見網絡Infiniband, 10GE GE Myrinet 。GPU , C-G混合架構。第二次課：蔣運宏VMM Virtual Machine Monitor ,虛擬機監(jiān)控程序。VMM勺基本特征：Equivalence (等價),Isolation(隔離),Efficiency(高效)。VMM!要能夠控制整個物理平臺，通過“ Ring Deprivileging ”實現(xiàn)CPg制。可虛擬化的指令集：特權指令，敏感指令。劉通：什么是 SuperComputing : biggest ,

22、 fastest 。 About Size and Speed 。Supercomputing用在對物理現(xiàn)象的仿真，數(shù)據(jù)挖掘，虛擬化。HPC的組件：硬件、軟件、應用程序和人。Remote DMA Zero-copy , Kernel bypassTCP/IP Networks: Overhead and Latency (負載和延遲)InfiniBand 的高性能體現(xiàn)在：高的吞吐量( highest throughput , 40Gb/s node to node and 120Gb/s switch to switch , Nearly 90M MPI messages per second

23、 , Send/receive and RDMA operations with zero-copy ), 低的延遲(lowest latency ,1-1.3usec MPI end-to-end ,0.9-1us InfiniBand latency for RDMAoperations ), 低白勺 CPU負載(Lowest CPU overhead )影響可擴展性的關鍵元素：硬件，軟件，程序本身隨著系統(tǒng)大小的增加，通信時間所占的比例持續(xù)增加Mostly used MPI functions , MPI 最常用的函數(shù)：MPI_Wait, MPI_Allreduce, and MPI_Bc

24、astInfiniBand provides higher utilization, performance and scalability,提供了更高的利用率，性能和可擴展能力。王璟：基本概念：并行計算(Parallel Computing ),高端計算(High-end Parallel Computing),高性能計算(High Performance Computing) , 超級計算 (Super Computing)為何要做HPC科學和工程問題的數(shù)值模擬與仿真，要求：在合理的時限內完成計算任務。如何滿足高精度計算的需求？一并行計算，降低單個問題求解的時間，增加問題求解規(guī)模，提高吞吐

25、率(多機同時執(zhí)行多個串行程序).高性能計算機：由多個計算單元組成，運算速度快、存儲容量大、可靠性高的計算機系統(tǒng)?？蒲袆?chuàng)新的三大支柱：，理論分析，計算模擬，觀察實驗。HPC應用：汽車制造，氣象預報，生物制藥，飛機制造，動畫渲染，金融計算，石油勘探。并行計算的硬件體系：并行計算機就是由多個處理單元組成的計算機系統(tǒng)，這些處理單元相互通信和協(xié)作以快速、高效求解大型復雜問題。結構模型：a)PVP; b)SMP c ) MPP Massively Parallel Processor，大規(guī)模并行處理器)；d) DSMdistributed shared memory,動態(tài)分布式存儲)；e) Cluste

26、r/COW ;訪存模型：多處理機(單地址空間共享存儲器)，UMA: Uniform Memory Access, NUMA:Nonuniform MemoryAccess ;多計算機(多地址空間非共享存儲器)，NORMA:No-Remote Memory Access。程序設計模型：a)隱式并行(Implicit Parallel ),就是各種并行編程語言，如 Fortran90, HPF(1992);共享變量(Shared Variable )，如 POSIXthreads 線程模型，OpenMP消息傳遞(Message Passing )，如 MPI ( Message Passing I

27、nterface ), PVM(Parallel Virtual Machine )。InfiniBand :以交換為核心；交換機是InfiniBand中的基本組件；點到點的交換結構：解決了共享總線、容錯性和可擴展性問題；具有物理層低功耗特點和箱外帶寬連接能力。InfiniBand的特點：高速度；遠程直接內存存取功能；傳輸卸載；CPUm速-GPU;網絡加速-InfiniBand ;內存加速-虛擬存儲；GPU( Graphic Processing Unit),用于個人計算機、工作站和游戲機的專用圖像顯示設備顯示卡或主板集成。CPU更多資源用于緩存；GP3多資源用于數(shù)據(jù)計算，適合具備可預測計算模

28、式的應用.實用文檔HPC面臨的挑戰(zhàn)：a）計算功耗比，即通用性和效率之間尋找一個平衡點；b）更高的并行度；c）足夠價值的艾級應用；d）容錯；e）所依賴的器件革命何時發(fā)生；f）與新興應用的關系；g）高性能應用軟件產業(yè)；83、集群技術的優(yōu)勢：通用的高性能：節(jié)點采用傳統(tǒng)服務器平臺，通用的硬件、操作系統(tǒng)，適應性強高可用性：高度的設備冗余，CPU內存、磁盤、節(jié)點機高可擴展性：以交換設備為核心，節(jié)點機、存儲可靈活填減更高的性價比：通用設備，統(tǒng)一的標準84、MPI： Massage Passing Interface:是消息傳遞函數(shù)庫的標準規(guī)范MPI是一個庫，而不是一門語言；MPI是一種消息傳遞編程模型，并成

29、為這種編程模型的代表和事實上的標準；MPI是一種標準或規(guī)范的代表，而不特指某一個對它的具體實現(xiàn)；目標：是提供一個實際可用的、可移植的、高效的和靈活的消息傳遞接口標準MPI提供C/C+和Fortran 語言的綁定基本縮寫（HPC- ）與高性能計算相關的縮寫5個Concurrency PipelineRISC會畫圖 illustrationHow to improve performance?Coding. How I speed up my code?A Trivial Example load-balance線程 PThread: POSIX Thread一、名詞解釋2） speed upHP

30、CC High Performance Computing and Communications（高性能計算和通信）RISC: Reduced Instruction Set Computing（精簡指令集）ILP : Instruction Level Parallelism指令集并行SMP Symmetric Multi-Processors對稱多處理器SMT Simultaneous Multi ThreadingMPP Massively Parallel Processor SISD： single instruction single data SIMD： single instr

31、uction multiple dta同步多線程大規(guī)模并行處理器單指令單數(shù)據(jù)單指令多數(shù)據(jù)MIMD multiple instructions multiple dataMISD： multiple instructions single dataMSP Multi-Streaming vector Processor MIPS： Millions of instructions / sec DAGs Directed Acyclic Graphs多串流向量處理器每秒百萬條指令FCFS First Come First ServeEASY Extensible Argonne Scheduli

32、ng System CUDA : Compute Unified Device Architecture 并行計算提出的原因：1、提高性能和存儲能力可擴展的Argonne調度系統(tǒng)通用并行計算架構2、使用戶和計算機之間相互協(xié)調3、獲得一個問題的邏輯結構4、處理獨立的物理設備并行的三大問題：性能，準確性，可編程性ProgrammabilityMPI : Massage Passing Interface是消息傳遞函數(shù)庫的標準規(guī)范Parallel computing :單一系統(tǒng)，眾核處理同一任務。時間上的并行就是指流水線技術，而空間上的并行則是指用多個處理器并發(fā)的執(zhí)行計算。并行計算優(yōu)勢：處理器總體

33、性能更強；總體內存更大。HPC1群峰值計算能力：一套配置256個雙路X5560處理器計算節(jié)點的 HPC1群，X5560: 2.8GHz Intel X5560Nehalem四核處理器，目前主流的處理器每時鐘周期提供4個雙精度浮點計算實用文檔峰值計算性能：2.8GHz*4Flops/Hz*4Core*2CPU*256 節(jié)點=22937.6GFlopsGflops=10億次，所以22937Gflops = 22.937TFlops=22.937 萬億次每秒的峰值性能（計算峰值）128個雙路2.66GHz Intel Nehalem四核處理器計算節(jié)點的HPC1群，其峰值計算是多少128*2*2.6

34、6G*4*4（一個時鐘周期可進行 4次浮點運算）=10,895GFlopsHow do I speed up my code?代碼嵌入消除公共表達式，消除冗余代碼,確定循環(huán)不變量，指針代替數(shù)組，循環(huán)展開，代碼移動變量替換函數(shù), 加法替代乘法，直接使用變量。One option to make code faster is basically to deal with the codeLet s look at some examples of what one can do by handThese techniques were very popular before compilers

35、were any goodOf course, we ll talk about what the compiler can do nowadaysTechniquereplace array accesses by poirrtrdereferencesTechnique #1: identify loop constantstot (ltO;kW.;k+ (J+)for (k-O,k、 /一一一、0 ptt+ / / H inteqet addition11t, 一mull皿；確7E俯環(huán)變重，用指針替代數(shù)組Technique #3: Loop Unrollingfor (i-0ii10

36、0 xi+)i 棟口1 -Technique#4: Code Motionsum = 0;for (i = 0; i = fACtdn)；1+ sum += i;i-u do I 覆Qi - i- i+；功口 w i- i+； illjl - i； + ； ftJtji - is i+:】bhibGr”in口”所 EmpPgnm 循環(huán)展開(減少比較次數(shù))stun - 0;f =；for (i = 0; 1 before execution we know how much load 的 given to each processor 巳Or as opposed tn some dynami

37、c algorithm tfiat assigns wortt to prccessars when theyre ready We 11 Come back to that idea when we talk about scheduling We will look st:2 simple load-balancing algorittimsapplication to cur 1-D stencil 叩pllcadonapplication to the 1-D distributed LU factorizationdiscussion of load-balancing for 2-

38、D data distributionsWe assume hctmogenetius networit and heterogeneous c&mpute nodes in this lecture Let 5 enrsider p processors Let 卜% 加 tht Lcytl timis of Hit prottsscri* i.e.H time to prtxsss one elemental unit of computation (wart( units) for the application (T06nl/Let E be the rnjmber of (ident

39、ical) work unitsLetcp be the number of work units pnKessed by each processorj + % = B Perfect lead balancing happens ifq t t is constant實用文檔then we can have perfect load balancing But in general the formula for j does not give an integer solutionThere is a simple algorithm that give the optimal (int

40、eger) allocation of work units to processors in O(pz)if B is a multiple oflem(hT&) X i二 Z t, i/ initialize with fractional valuesrounded downFor i = 1 to p_Lci _J J_ X BJ/ iteratively increase the c, values while (c + . + cp B)find k in 1p such thattfc(ch+ 1)-min(yc| + 1)g = G + 1 3 processors- 10 w

41、ork unite53 I- 一P1Note that the previous algorithm can be modifiea slightly to record the optimal allocation for B=l/ B=2, etc.One can write the result as a list of processors.8=1: PB=2: Plr P2B = 3； Pr Pq PlB=4: P“ Pm hr PaB=5: Pj, P Pn Pr P：B=6: Plr P2f PL % P2, B = 7： Plf P3r PI1r P加 P” P& 1etc.W

42、e will see how this comes in handy for some load-balancing problems (e.g., LU factorization)4 ProcessorsEach processor handles many rk blocksWhat if ttie processors are heterogeneous?Just give rows to processors proportionally to their speedStill in a cyclic patternShould have perfect efficiencyBut

43、there could of course be the usual notions of rounding off e,g,F what if a processor is 1,0001 faster than another?實用文檔to fnd t叫 max 為ReductionIndepeof thent computation scaling faiorComputeEvery update needs the scaling factor and the element from the pivot dowIndependentcomputationsBroadcastsCompu

44、te1 requiresload-balancingOur original cyclic distribution doesnt work well in a heterogeneous settingStart with a non-cyclic distributionAt each (or every k) step與j rebalanceAt each step dll processorshave tfie same amount of workto doRedistribution is expensive in term5 of communications Just do w

45、hat we did for the stencil application Use the distribution obtained with the Incremental algorithm we saw before, reversed B=10： P, p2r Php1t Plf Plr P2f PmBut not great MBoptimal iMd-balancingfor 10 columnsoptimal loed-balandngfar 7 columnsoptimal load-balancingfor 4 columns實用文檔Of course, this sho

46、uld be done for blocks of columns, and not individual columns Also, should be done in a 4Lmotif that spans some number of columns (B=10 in our example) and is repeated cyclically throughout the matrixprovided that B is large enough, we get a good approximation of the optimal loadbalance2-D Data Dist

47、ributions (Sec 62)What we ve seen so far works well for 1-D data distributionsuse the simple algorithmuse the allocation pattern over block in a cyclic fashionWe have seen that a 2-D distribution is whats most appropriate, for instance for matrix multiplicationWe use matrix multiplication as our dri

48、ving exampleC = A xBLet us assume that we have a 5P processor grid, and that p=q=n,9 all 3 matrices are distributed identically2-D Matrix MultiplicationPup單Pzi匕電3p享p歲P搴Processor GridAi工Alr3A24%工A2p3氣1A3,2A窣 We have seen 3 algorithms to do a matrix multiplication (Cannon, Fox, Snyder) Pretty difficul

49、t to generalize them for a heterogeneous platform The outer product is much simpler, and thus easier to adapt and modify Let s look at the outer product algorithm on a heterogeneous 2-D distribution Outer-product AlgorithmProceeds in k=l,n steps Horizontal broadcasts P/ for all i = L.也 broadcasts ai

50、k to processors in its processor row Vertical broadcasts PK j, for all J = broadcasts akj to processors in its processor columnIndependent computations processor can update cd =+ aik x akjOuter-product AlgorithmLet 10be the cycle time of processor PlrWe assign to processor Pj_. a rectangle of siex q

51、First, let us note that it is not always possible to achieve perfect load-balancing There are some theorems that show that its only possible if the processor grid matrix, with processor cycle times, put in their spot is of rank 1Each processor computes for父 q x t；j timeTherefore, the total execution

52、 time isT = ma0 % x q x %實用文檔Load-balancing can be expressed as a constrained minimization problem minimize max,j r； x Cj x % with the constraintspp門=nE 0=ni=lJ=1The load-balancing problem is in fact much more complexOne can place processors in any place of the processor gridOne must look for the optimal given all possible arrange

人人文庫> 全部分類> 行業(yè)資料 > 信息產業(yè)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請聯(lián)系上傳者。文件的所有權益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網頁內容里面會有圖紙預覽，若沒有圖紙預覽就沒有圖紙。
4. 未經權益所有人同意不得將文件中的內容挪作商業(yè)或盈利用途。
5. 人人文庫網僅提供信息存儲空間，僅對用戶上傳內容的表現(xiàn)方式做保護處理，對用戶上傳分享的文檔內容本身不做任何修改或編輯，并不能對任何下載內容負責。
6. 下載文件中如有侵權或不適當內容，請與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準確性、安全性和完整性, 同時也不承擔用戶因使用這些下載資源對自己和他人造成任何形式的傷害或損失。

高性能計算習題及問題詳解

文檔簡介

溫馨提示

最新文檔

評論

高性能計算習題及問題詳解

文檔簡介

溫馨提示

最新文檔

評論

相關文檔