版權(quán)說明:本文檔由用戶提供并上傳,收益歸屬內(nèi)容提供方,若內(nèi)容存在侵權(quán),請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)
文檔簡介
1、IntroductionThis article explores making use of the GPU for general purpose processingfrom .NET.Backgroundnon-graphics work. The world”s fastest super computer - the Tianhe-IA - makes useGraphics Processing Units (GPUs) are being increasingly used to perform of rather a lot of GPUs. The reason for u
2、sing GPUs is the massively parallelprocessors offer six or eight cores, a GPU can have hundreds of cores. Furthermore,architecture they provide. Whereas even the top of the range Intel and AMD GPUs have various types of memory that allow efficient addressing schemes. Depending on the algorithm, this
3、 can all give a massive performance increase, andspeed-ups of 100 x or more are not uncommon or even that complicated to achieve.advantage of GPUs. I”m typing on a fairly cheap Acer laptop ($800) and it has anThis is not just something for super computers though. Even normal PCs can take Intel i5 pr
4、ocessor and an NVIDIA GT540M GPU. This little thing hardly runs warm and can give my fairly standard workstation with its two NVIDIA GTX 460s a goodrun for its money. The amazing thing about this workstation is that in idealIf you look through the history of super computers, this means I”ve got some
5、thingconditions it can do 1 teraflops (less than 100GFLOPs are down to the Intel i7 CPU) .that matches the performance of the best computer in the world in 1996 (ASCI Red).1996 is not that long ago. In 2026, can we have a Tianhe-IA under our desks?the algorithm. Not everything will benefit and somet
6、imes you need to be creative.The key point above is that whether an application can speed up or not is down to Whether you have the time or budget to do this must be weighed up. Anywaywhatever lowers the hurdle to taking advantage of the supercomputer in your PC orlaptop can only be a good thing. In
7、 the world of General Purpose GPU (GPGPU)Compiling requires use of the NVIDIA NVCC compiler which then makes use of theCUDA from NVIDIA is currently the most user friendly. This is a variety of C. Microsoft Visual C+ compiler. It”s not a tough language to learn but it does raiseapplications. To exte
8、nd a ”normal” application to offload to the GPU needs a differentsome interesting issues. Applications tend to become first and foremost CUDA approach and typically the CUDA Driver API is used. You compile modules with theNVCC compiler and load them into your application.This is all very well but it
9、 leaves a rather clunky approach. You have two separatecode bases. This may not be a big deal if you have not enjoyed Visual Studio and .NET. Using CUDA from .NET is another story. Currently NVIDIA direct .NETHowever GPU code must still be written and compiled separately with NVCC andwhich is a nice
10、, if rather thin, wrapper around the CUDA API.people to CUDA.NETwork appears to have stopped on CUDA.NET. The latest changes that came in with CUDA 3.2 and now 4.0 mean that a number of things are broken (e.g. CUDA 3.2 introduced 64-bit pointers and v2 versions of much of the API).Cudafy.NET. Cudafy
11、 is the unofficial verb used to describe porting CPU code toBeing a die hard .NET developer, it was time to rectify matters and the result is CUDA GPU code. Cudafy.NET allows you to program the GPU completely from withinshow.your .NET application with a minimum of messy, clunky business. Now on with
12、 the The CodeThis project will get you set-up and running with Cudafy.NET. A number of simpleroutines will be run on the GPU from a standard .NET application. You”ll need to dorecent NVIDIA Graphics card, one that supports CUDA. If you don”t, then it”s not thea few things first if you have not alrea
13、dy got them. First be sure you have a relativelyend of the world since Cudafy supports GPGPU emulation. Emulation is good for debugging but depending on how many threads you are trying to run in parallel canbe painfully slow. If you have a normal PC, you can pick up an NVIDIA PCI ExpressCUDA GPU for
14、 very little money. Since the applications scale automatically, your app will run on all CUDA GPUs from the minimal ones in some net books up to thededicated high end Tesla varieties. You”ll then need to go to the NVIDIA CUDA website and download theCUDA 4.0Toolkit. Install this in the default locat
15、ions. Next up, ensure that you have the latest NVIDIA drivers. These can be obtained through NVIDIA update or from here. The project here was built using Visual Studio 2022 and the language is C#, though VBproblem - bear in mind that only 32-bit apps can be created though. To get thisand other .NET
16、languages should be fine. Visual Studio Express can be used withoutVisual Studio Express website:requires this)working with Express, you need to visitDownload and install Visual C+ 2022 Express (the NVIDIA compiler Download and install Visual C# 2022 Express 3.Download and install CUDA 4.0 Toolkit (
17、32-bit and/or 64-bit)4. Ensure that the C+ compiler ( cl.exe) is on the search path (Environment variables) May need a reboot5.persevere. Read any errors you get carefully - most likely they are related to notThis set-up of NVCC is actually the toughest stage of the whole process, so please cl.exear
18、chitecture is required. There is no real getting around this. It is not the goal of thisfindingor not having either 32-bit or 64-bit CUDA Toolkit.by Jason Sanders andFinally, to make proper use of Cudafy, a basic understanding of the CUDA tutorial to provide this, so I refer you to CUDA by ExampleEd
19、ward Kandrot.Using the CodeThe downloadable code provides a VS2022 C# 4.0 console application including theCudafy libraries. For more information on the Cudafy.NET SDK, please visit website. The application does some basic operations on the GPU. A singlethereference is added to the Cudafy.NET.dll. F
20、or translation to CUDA C, it relies on the excellent ILSpy .NET decompiler from SharpDevelop and Mono.Cecil from JB Evian.Thecontain this functionality. Cudafy.NET currently relies on very slightlylibraries ICSharpCode.Decompiler.dll, ICSharpCode.NRefactory.dll, IlSpy.dll and M ono.Cecil.dllmodified
21、 versions of the ILSpy 22 libraries. There are various namespaces to contend with. InProgam.cs, we use the following:using Cudafy;Collapse | Copy Codeusing Cudafy.Host;using Cudafy.Translator;To specify which functions you wish to run on the GPU, you applythe Cudafy attribute. The simplest po
22、ssible function you can run on your GPU is illustrated by the method kernel and a slightly more complex and useful one isalso shown:Collapse |Copy CodeCudafypublic static void kernelCudafypublic static void add(int a, int b, int c)c0 = a + b;These methods can be converted into GPU code from within t
23、he same application by use of CudafyTranslator. This is a wrapper around the ILSpy derived CUDAlanguage and simply converts .NET code into CUDA C and encapsulates this along with reflection information into a CudafyModule. The CudafyModule hasa Compilemethod that wraps the NVIDIA NVCC compiler. The
24、output of the NVCCcompilation is what NVIDIA call PTX. This is a form of intermediate language for theGPU and allows multiple generations of GPU to work with the same applications. This is also stored in the CudafyModule. A CudafyModule can also be serializedand deserialized to/from XML. The default
25、 extension of such files is *.cdfy. SuchXML files are useful since they enable us to speed things up and avoid unnecessarycompilation on future runs. The CudafyModule has methods for checking thechecksum - essentially we test to see if the .NET code has changed since the cached XML file was created.
26、 If not, then we can safely assume the deserialized CudafyModule instance and the .NET code are in sync. In this example project, we use the ”smart” Cudafy method onthe CudafyTranslator which does caching and type inference automagically. It does the equivalent of the following steps:Collapse |Cudaf
27、yModule km = CudafyModule.TryDeserialize(typeof(Program).Name); if (km = null | !km.TryVerifyChecksums)km = CudafyTranslator.Cudafy(typeof(Program);Copy Codekm.Serialize;extension is added automatically if it is not explicitly specified. If the file does not exist or fails for some reason, then null
28、 is returned. In contrast,the Deserialize method throws an exception if it fails. If null is not returned,then we verify the checksums using TryVerifyChecksums. This method returns false if the file was created using an assembly with a different checksum. TheIn the first line, we attempt to deserial
29、ize from an XML file calledProgram.cdfyIf both this and the prior check fail, then we cudafy again. This time, we explicitly pass the type we wish to cudafy. Multiple types can be specified here. Finally, weserialize this for future use. isNow we have a valid module, we can proceed. To load the modu
30、le, we need first to get a handle to the desired GPU. This is done as follows. GetDeviceoverloaded and can include a device id for specifying which GPU in systems with multiple, and it returns an abstract GPGPU instance. The eGPUType enumerator can be Cuda or Emulator. Loading the module is done fai
31、rly obviously in the nextline. CudaGPU and EmulatedGPU derive from GPGPU.Collapse |Copy Code_gpu = CudafyHost.GetDevice(eGPUType.Cuda);_gpu.LoadModule(km);To run the method kernel we need to use the Launch method on the GPU instance (Launch is a fancy GPU way of saying start or in .NET parlance Invo
32、ke). There are many overloaded versions of this method. The most straightforward andcleanest to use are those that take advantage of .NET 4.0”s dynamic languagerun-time (DLR).Collapse |Copy Code_gpu.Launch.kernel;Launch takes in this case zero arguments which means one thread is started on the GPU a
33、nd it runs the kernelmethod which also takes zero arguments. Analternative non-dynamic way of launching is shown below. Advantages are that isfaster first time round. The DLR can add up to 50ms doing its wizardry. The two arguments of value 1 refer to the number of threads (basically 1 * 1), but mor
34、e onthis later.Collapse |Copy Code_gpu.Launch(1, 1, “kernel“);There are a number of other examples provided including the compulsory ”Hello,world” (written in Unicode on a GPU). Of greater interest is the Add vectors code since working on large data sets is the bread and butter of GPGPU. Ourmethod a
35、ddVector is defined as:Collapse |Cudafypublic static void addVector(GThread thread, int a, int b, int c)/ Get the id of the thread. addVector is called N times in parallel, so we need/ to know which one we are dealing tid = thread.blockIdx.x;/ To prevent reading beyond the end of the array
36、we check that/ the id is less than Lengthif (tid a.Length)ctid = atid + btid;Copy CodeParameters a and b are the input vectors and c is the resultant vector. GThread is the surprise component. Since the GPU will launch many threads of addVector inparallel we need to be able to identify within the me
37、thod which thread we”re dealingwith. This is achieved through CUDA built-in variables which in Cudafy are accessible viaGThread.If you have an arraya of length N on the host and you want to process it on the GPU, then you need to transfer the data there. We use theCopyToDevice method of theGPU insta
38、nce.Collapse |int a = new intN;int dev_a = _gpu.CopyToDevice(a);Copy CodeWhat is interesting here is the return value ofCopyToDevice. It looks like an array of integers. However if you were to hover your mouse over it in the debugger, you”d see it has length of zero, not N. What has been returned is
39、 a pointer to the data onthe GPU. It is only valid in GPU code (the methods you marked withthe Cudafy attribute). The GPU instance stores these pointers. Transferring data tothe GPU is all very well but we also may need memory on the GPU for result or intermediate data. For this, we use the Allocate
40、 method. Below is the code toallocate Nintegers on the GPU.Collapse |int dev_c = _gpu.Allocate(N);Copy CodeLaunching themethod is more complex and requires arguments toaddVectorspecify how many threads, as well as arguments for the target method itself. Collapse |Copy Code_gpu.Launch(N, 1).addVector
41、(dev_a, dev_b, dev_c);Threads are grouped in Blocks. Blocks are grouped in a Grid. Here we launch N Blocks where each block contains 1 thread. Note addVector containsa GThread arg - there is no need to pass this as an argument. As statedearlier,GThread is the Cudafy equivalent of the built-in CUDA v
42、ariables and we useit to identify thread id. The diagram below shows an example with a grid containinga 2D array of blocks where each block contains a 2D array of threads.Another interesting point of note is the FreeAll method on the GPU instance.Memory on a GPU is typically more limited than that o
43、f the host so use it wisely. Youneed to free memory explicitly, however if the GPU instance goes out of scope, thenits destructor will clear up GPU memory.is defined:The final example is somewhat more complex and illustrates the use of structures and multi-dimensional arrays. In file Struct.cs Compl
44、exFloatCollapse |Copy CodeCudafypublic struct ComplexFloatpublic ComplexFloat(float r, float i)Real = r;Imag = i;public float Real; public float Imag;public ComplexFloat Add(ComplexFloat c)return new ComplexFloat(Real + c.Real, Imag + c.Imag);The complete structure will be translated. It is not nece
45、ssary to put attributes on themembers. We can freely make use of this structure in both the host and GPU code.In this case, we initialize the 3D array of these on the host and then transfer to theGPU. On the GPU, we do the following: Cudafypublic static void struct3D(GThread thread, ComplexFloat, result)while (z result.GetLength(2)resultx, y, z = resultx, y, z.Add(resultx, y, z); z+;Collapse | Copy Codeintintxy=thread.blockIdx.x;thread.blockIdx.y;intz=0;and y component (we launch a grid of threads equal to the size of the x and yThe threads are launched this time
溫馨提示
- 1. 本站所有資源如無特殊說明,都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
- 2. 本站的文檔不包含任何第三方提供的附件圖紙等,如果需要附件,請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
- 3. 本站RAR壓縮包中若帶圖紙,網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽,若沒有圖紙預(yù)覽就沒有圖紙。
- 4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
- 5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間,僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理,對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯,并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
- 6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容,請(qǐng)與我們聯(lián)系,我們立即糾正。
- 7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。
最新文檔
- 學(xué)校學(xué)習(xí)輔導(dǎo)與課外活動(dòng)管理制度
- 售樓員考試題目及答案
- 養(yǎng)老院膳食營養(yǎng)配餐制度
- 養(yǎng)老院老人營養(yǎng)膳食制度
- 養(yǎng)老院老人生活設(shè)施管理制度
- 七下生物比賽題目及答案
- 六職考試題目及答案
- 門診消防安全制度
- 酒廠食品安全主體責(zé)任制度
- 造價(jià)公司制度
- DB21-T 4279-2025 黑果腺肋花楸農(nóng)業(yè)氣象服務(wù)技術(shù)規(guī)程
- 2026廣東廣州市海珠區(qū)住房和建設(shè)局招聘雇員7人考試參考試題及答案解析
- 2026新疆伊犁州新源縣總工會(huì)面向社會(huì)招聘工會(huì)社會(huì)工作者3人考試備考題庫及答案解析
- 廣東省汕頭市2025-2026學(xué)年高三上學(xué)期期末語文試題(含答案)(含解析)
- 110接處警課件培訓(xùn)
- DB15∕T 385-2025 行業(yè)用水定額
- 2025四川數(shù)據(jù)集團(tuán)有限公司第四批員工招聘5人參考題庫含答案解析(奪冠)
- 火箭軍教學(xué)課件
- 新媒體運(yùn)營專員筆試考試題集含答案
- 護(hù)理不良事件之血標(biāo)本采集錯(cuò)誤分析與防控
- 數(shù)字孿生技術(shù)服務(wù)協(xié)議2025
評(píng)論
0/150
提交評(píng)論