使用CUDAfy進(jìn)行DotNet編程的簡單教程

上傳人：想*** IP屬地：天津上傳時(shí)間：2022-09-09 格式：DOC 頁數(shù)：9 大?。?97.41KB 積分：15 舉報(bào) 版權(quán)申訴

已閱讀5頁，還剩4頁未讀，繼續(xù)免費(fèi)閱讀

版權(quán)說明：本文檔由用戶提供并上傳，收益歸屬內(nèi)容提供方，若內(nèi)容存在侵權(quán)，請(qǐng)進(jìn)行舉報(bào)或認(rèn)領(lǐng)

文檔簡介

1、IntroductionThis article explores making use of the GPU for general purpose processingfrom .NET.Backgroundnon-graphics work. The world”s fastest super computer - the Tianhe-IA - makes useGraphics Processing Units (GPUs) are being increasingly used to perform of rather a lot of GPUs. The reason for u

2、sing GPUs is the massively parallelprocessors offer six or eight cores, a GPU can have hundreds of cores. Furthermore,architecture they provide. Whereas even the top of the range Intel and AMD GPUs have various types of memory that allow efficient addressing schemes. Depending on the algorithm, this

3、 can all give a massive performance increase, andspeed-ups of 100 x or more are not uncommon or even that complicated to achieve.advantage of GPUs. I”m typing on a fairly cheap Acer laptop ($800) and it has anThis is not just something for super computers though. Even normal PCs can take Intel i5 pr

4、ocessor and an NVIDIA GT540M GPU. This little thing hardly runs warm and can give my fairly standard workstation with its two NVIDIA GTX 460s a goodrun for its money. The amazing thing about this workstation is that in idealIf you look through the history of super computers, this means I”ve got some

5、thingconditions it can do 1 teraflops (less than 100GFLOPs are down to the Intel i7 CPU) .that matches the performance of the best computer in the world in 1996 (ASCI Red).1996 is not that long ago. In 2026, can we have a Tianhe-IA under our desks?the algorithm. Not everything will benefit and somet

6、imes you need to be creative.The key point above is that whether an application can speed up or not is down to Whether you have the time or budget to do this must be weighed up. Anywaywhatever lowers the hurdle to taking advantage of the supercomputer in your PC orlaptop can only be a good thing. In

7、 the world of General Purpose GPU (GPGPU)Compiling requires use of the NVIDIA NVCC compiler which then makes use of theCUDA from NVIDIA is currently the most user friendly. This is a variety of C. Microsoft Visual C+ compiler. It”s not a tough language to learn but it does raiseapplications. To exte

8、nd a ”normal” application to offload to the GPU needs a differentsome interesting issues. Applications tend to become first and foremost CUDA approach and typically the CUDA Driver API is used. You compile modules with theNVCC compiler and load them into your application.This is all very well but it

9、 leaves a rather clunky approach. You have two separatecode bases. This may not be a big deal if you have not enjoyed Visual Studio and .NET. Using CUDA from .NET is another story. Currently NVIDIA direct .NETHowever GPU code must still be written and compiled separately with NVCC andwhich is a nice

10、, if rather thin, wrapper around the CUDA API.people to CUDA.NETwork appears to have stopped on CUDA.NET. The latest changes that came in with CUDA 3.2 and now 4.0 mean that a number of things are broken (e.g. CUDA 3.2 introduced 64-bit pointers and v2 versions of much of the API).Cudafy.NET. Cudafy

11、 is the unofficial verb used to describe porting CPU code toBeing a die hard .NET developer, it was time to rectify matters and the result is CUDA GPU code. Cudafy.NET allows you to program the GPU completely from withinshow.your .NET application with a minimum of messy, clunky business. Now on with

12、 the The CodeThis project will get you set-up and running with Cudafy.NET. A number of simpleroutines will be run on the GPU from a standard .NET application. You”ll need to dorecent NVIDIA Graphics card, one that supports CUDA. If you don”t, then it”s not thea few things first if you have not alrea

13、dy got them. First be sure you have a relativelyend of the world since Cudafy supports GPGPU emulation. Emulation is good for debugging but depending on how many threads you are trying to run in parallel canbe painfully slow. If you have a normal PC, you can pick up an NVIDIA PCI ExpressCUDA GPU for

14、 very little money. Since the applications scale automatically, your app will run on all CUDA GPUs from the minimal ones in some net books up to thededicated high end Tesla varieties. You”ll then need to go to the NVIDIA CUDA website and download theCUDA 4.0Toolkit. Install this in the default locat

15、ions. Next up, ensure that you have the latest NVIDIA drivers. These can be obtained through NVIDIA update or from here. The project here was built using Visual Studio 2022 and the language is C#, though VBproblem - bear in mind that only 32-bit apps can be created though. To get thisand other .NET

16、languages should be fine. Visual Studio Express can be used withoutVisual Studio Express website:requires this)working with Express, you need to visitDownload and install Visual C+ 2022 Express (the NVIDIA compiler Download and install Visual C# 2022 Express 3.Download and install CUDA 4.0 Toolkit (

17、32-bit and/or 64-bit)4. Ensure that the C+ compiler ( cl.exe) is on the search path (Environment variables) May need a reboot5.persevere. Read any errors you get carefully - most likely they are related to notThis set-up of NVCC is actually the toughest stage of the whole process, so please cl.exear

18、chitecture is required. There is no real getting around this. It is not the goal of thisfindingor not having either 32-bit or 64-bit CUDA Toolkit.by Jason Sanders andFinally, to make proper use of Cudafy, a basic understanding of the CUDA tutorial to provide this, so I refer you to CUDA by ExampleEd

19、ward Kandrot.Using the CodeThe downloadable code provides a VS2022 C# 4.0 console application including theCudafy libraries. For more information on the Cudafy.NET SDK, please visit website. The application does some basic operations on the GPU. A singlethereference is added to the Cudafy.NET.dll. F

20、or translation to CUDA C, it relies on the excellent ILSpy .NET decompiler from SharpDevelop and Mono.Cecil from JB Evian.Thecontain this functionality. Cudafy.NET currently relies on very slightlylibraries ICSharpCode.Decompiler.dll, ICSharpCode.NRefactory.dll, IlSpy.dll and M ono.Cecil.dllmodified

21、 versions of the ILSpy 22 libraries. There are various namespaces to contend with. InProgam.cs, we use the following:using Cudafy;Collapse | Copy Codeusing Cudafy.Host;using Cudafy.Translator;To specify which functions you wish to run on the GPU, you applythe Cudafy attribute. The simplest po

22、ssible function you can run on your GPU is illustrated by the method kernel and a slightly more complex and useful one isalso shown:Collapse |Copy CodeCudafypublic static void kernelCudafypublic static void add(int a, int b, int c)c0 = a + b;These methods can be converted into GPU code from within t

23、he same application by use of CudafyTranslator. This is a wrapper around the ILSpy derived CUDAlanguage and simply converts .NET code into CUDA C and encapsulates this along with reflection information into a CudafyModule. The CudafyModule hasa Compilemethod that wraps the NVIDIA NVCC compiler. The

24、output of the NVCCcompilation is what NVIDIA call PTX. This is a form of intermediate language for theGPU and allows multiple generations of GPU to work with the same applications. This is also stored in the CudafyModule. A CudafyModule can also be serializedand deserialized to/from XML. The default

25、 extension of such files is *.cdfy. SuchXML files are useful since they enable us to speed things up and avoid unnecessarycompilation on future runs. The CudafyModule has methods for checking thechecksum - essentially we test to see if the .NET code has changed since the cached XML file was created.

26、 If not, then we can safely assume the deserialized CudafyModule instance and the .NET code are in sync. In this example project, we use the ”smart” Cudafy method onthe CudafyTranslator which does caching and type inference automagically. It does the equivalent of the following steps:Collapse |Cudaf

27、yModule km = CudafyModule.TryDeserialize(typeof(Program).Name); if (km = null | !km.TryVerifyChecksums)km = CudafyTranslator.Cudafy(typeof(Program);Copy Codekm.Serialize;extension is added automatically if it is not explicitly specified. If the file does not exist or fails for some reason, then null

28、 is returned. In contrast,the Deserialize method throws an exception if it fails. If null is not returned,then we verify the checksums using TryVerifyChecksums. This method returns false if the file was created using an assembly with a different checksum. TheIn the first line, we attempt to deserial

29、ize from an XML file calledProgram.cdfyIf both this and the prior check fail, then we cudafy again. This time, we explicitly pass the type we wish to cudafy. Multiple types can be specified here. Finally, weserialize this for future use. isNow we have a valid module, we can proceed. To load the modu

30、le, we need first to get a handle to the desired GPU. This is done as follows. GetDeviceoverloaded and can include a device id for specifying which GPU in systems with multiple, and it returns an abstract GPGPU instance. The eGPUType enumerator can be Cuda or Emulator. Loading the module is done fai

31、rly obviously in the nextline. CudaGPU and EmulatedGPU derive from GPGPU.Collapse |Copy Code_gpu = CudafyHost.GetDevice(eGPUType.Cuda);_gpu.LoadModule(km);To run the method kernel we need to use the Launch method on the GPU instance (Launch is a fancy GPU way of saying start or in .NET parlance Invo

32、ke). There are many overloaded versions of this method. The most straightforward andcleanest to use are those that take advantage of .NET 4.0”s dynamic languagerun-time (DLR).Collapse |Copy Code_gpu.Launch.kernel;Launch takes in this case zero arguments which means one thread is started on the GPU a

33、nd it runs the kernelmethod which also takes zero arguments. Analternative non-dynamic way of launching is shown below. Advantages are that isfaster first time round. The DLR can add up to 50ms doing its wizardry. The two arguments of value 1 refer to the number of threads (basically 1 * 1), but mor

34、e onthis later.Collapse |Copy Code_gpu.Launch(1, 1, “kernel“);There are a number of other examples provided including the compulsory ”Hello,world” (written in Unicode on a GPU). Of greater interest is the Add vectors code since working on large data sets is the bread and butter of GPGPU. Ourmethod a

35、ddVector is defined as:Collapse |Cudafypublic static void addVector(GThread thread, int a, int b, int c)/ Get the id of the thread. addVector is called N times in parallel, so we need/ to know which one we are dealing tid = thread.blockIdx.x;/ To prevent reading beyond the end of the array

36、we check that/ the id is less than Lengthif (tid a.Length)ctid = atid + btid;Copy CodeParameters a and b are the input vectors and c is the resultant vector. GThread is the surprise component. Since the GPU will launch many threads of addVector inparallel we need to be able to identify within the me

37、thod which thread we”re dealingwith. This is achieved through CUDA built-in variables which in Cudafy are accessible viaGThread.If you have an arraya of length N on the host and you want to process it on the GPU, then you need to transfer the data there. We use theCopyToDevice method of theGPU insta

38、nce.Collapse |int a = new intN;int dev_a = _gpu.CopyToDevice(a);Copy CodeWhat is interesting here is the return value ofCopyToDevice. It looks like an array of integers. However if you were to hover your mouse over it in the debugger, you”d see it has length of zero, not N. What has been returned is

39、 a pointer to the data onthe GPU. It is only valid in GPU code (the methods you marked withthe Cudafy attribute). The GPU instance stores these pointers. Transferring data tothe GPU is all very well but we also may need memory on the GPU for result or intermediate data. For this, we use the Allocate

40、 method. Below is the code toallocate Nintegers on the GPU.Collapse |int dev_c = _gpu.Allocate(N);Copy CodeLaunching themethod is more complex and requires arguments toaddVectorspecify how many threads, as well as arguments for the target method itself. Collapse |Copy Code_gpu.Launch(N, 1).addVector

41、(dev_a, dev_b, dev_c);Threads are grouped in Blocks. Blocks are grouped in a Grid. Here we launch N Blocks where each block contains 1 thread. Note addVector containsa GThread arg - there is no need to pass this as an argument. As statedearlier,GThread is the Cudafy equivalent of the built-in CUDA v

42、ariables and we useit to identify thread id. The diagram below shows an example with a grid containinga 2D array of blocks where each block contains a 2D array of threads.Another interesting point of note is the FreeAll method on the GPU instance.Memory on a GPU is typically more limited than that o

43、f the host so use it wisely. Youneed to free memory explicitly, however if the GPU instance goes out of scope, thenits destructor will clear up GPU memory.is defined:The final example is somewhat more complex and illustrates the use of structures and multi-dimensional arrays. In file Struct.cs Compl

44、exFloatCollapse |Copy CodeCudafypublic struct ComplexFloatpublic ComplexFloat(float r, float i)Real = r;Imag = i;public float Real; public float Imag;public ComplexFloat Add(ComplexFloat c)return new ComplexFloat(Real + c.Real, Imag + c.Imag);The complete structure will be translated. It is not nece

45、ssary to put attributes on themembers. We can freely make use of this structure in both the host and GPU code.In this case, we initialize the 3D array of these on the host and then transfer to theGPU. On the GPU, we do the following: Cudafypublic static void struct3D(GThread thread, ComplexFloat, result)while (z result.GetLength(2)resultx, y, z = resultx, y, z.Add(resultx, y, z); z+;Collapse | Copy Codeintintxy=thread.blockIdx.x;thread.blockIdx.y;intz=0;and y component (we launch a grid of threads equal to the size of the x and yThe threads are launched this time

人人文庫> 全部分類> 教育資料 > 輔導(dǎo)培訓(xùn)

溫馨提示

1. 本站所有資源如無特殊說明，都需要本地電腦安裝OFFICE2007和PDF閱讀器。圖紙軟件為CAD,CAXA,PROE,UG,SolidWorks等.壓縮文件請(qǐng)下載最新的WinRAR軟件解壓。
2. 本站的文檔不包含任何第三方提供的附件圖紙等，如果需要附件，請(qǐng)聯(lián)系上傳者。文件的所有權(quán)益歸上傳用戶所有。
3. 本站RAR壓縮包中若帶圖紙，網(wǎng)頁內(nèi)容里面會(huì)有圖紙預(yù)覽，若沒有圖紙預(yù)覽就沒有圖紙。
4. 未經(jīng)權(quán)益所有人同意不得將文件中的內(nèi)容挪作商業(yè)或盈利用途。
5. 人人文庫網(wǎng)僅提供信息存儲(chǔ)空間，僅對(duì)用戶上傳內(nèi)容的表現(xiàn)方式做保護(hù)處理，對(duì)用戶上傳分享的文檔內(nèi)容本身不做任何修改或編輯，并不能對(duì)任何下載內(nèi)容負(fù)責(zé)。
6. 下載文件中如有侵權(quán)或不適當(dāng)內(nèi)容，請(qǐng)與我們聯(lián)系，我們立即糾正。
7. 本站不保證下載資源的準(zhǔn)確性、安全性和完整性, 同時(shí)也不承擔(dān)用戶因使用這些下載資源對(duì)自己和他人造成任何形式的傷害或損失。

使用CUDAfy進(jìn)行DotNet編程的簡單教程

文檔簡介

溫馨提示

最新文檔

評(píng)論

使用CUDAfy進(jìn)行DotNet編程的簡單教程

文檔簡介

溫馨提示

最新文檔

評(píng)論

相關(guān)文檔