GPU Computing with CUDA

NVIDIA's CUDA is one of the most widely used GPU computing platforms. Numerics.NET lets you take advantage of CUDA-enabled graphics cards and devices through its distributed computing framework. CUDA support is enabled through the CudaProvider class.

This section only discusses issues specific to CUDA. For general information on the distributed computing framework, see the previous section on distributed and GPU computing.

Prerequisites

Version 4.0 or higher of the .NET Framework is required in order to use the CUDA functionality, You also need to have NVIDIA CUDA Toolkit v5.5 (for 32 bit) or v7.5 (for 64 bit) installed on your machine. This toolkit can be downloaded from NVIDIA's website.

To run the software, you need a CUDA-enabled graphics card with compute capability 1.3 or higher.

Creating CUDA enabled applications

The first step in adding CUDA support to your application is to add a reference to the CUDA provider package for your platform, Numerics.NET.Native.Cuda to your application.

Next, you need to inform the distributed computing framework that you are using the CUDA provider:

C#
DistributedProvider.Current =
    Numerics.NET.Distributed.CudaProvider.Default;

Finally, you need to adapt your code to use distributed arrays where appropriate. The guidelines for working with distributed arrays from the previous section apply to CUDA code as well.

CUDA-specific functionality

The CUDA provider exposes a number of functions specific to the CUDA environment:

Method

Description

GetAvailableMemory()

Returns the free memory available on the device, in bytes. Note that because of memory fragmentation, it is unlikely that a block of this size can be allocated.

GetTotalMemory()

Returns the total memory on the device, in bytes.

GetDeviceLimit(Int32)

Wrapper for the cudaDeviceGetLimit function.

The GetAvailableMemory() method is particularly useful for verifying that all device memory has been properly released.

Inter-operating with other CUDA libraries.

The CUDA provider supplies a large number of functions that are optimized for use on CUDA GPU's. Sometimes it is necessary to call into external libraries. This section outlines how to do this.

A pointer to device memory can be obtained from a distributed array through the NativeStorage property for vectors, and NativeStorage. These methods return a storage structure that has two relevant fields.

For vectors, the Values property is an IntPtr that points to the start of the memory block that contains the data for the vector. The Offset is the number of elements (not bytes) from the start of the memory block where the first element in the vector is stored. This information can be combined to get the starting address for the vector's elements.

Storage for vectors may not be contiguous. This can happen, for example, when the vector represents a row in a matrix. The Stride property specifies the number of elements between vector elements. A value of 1 corresponds to contiguous storage.

For matrices, the Values property is an IntPtr that points to the start of the memory block that contains the data for the vector. The Offset is the number of elements (not bytes) from the start of the memory block where the first element in the vector is stored. This information can again be combined to get the starting address for the vector's elements.

Matrices are stored in column-major order. This means that columns are stored contiguously. It is possible that not all elements in a matrix are contiguous. The LeadingDimension property specifies the number of elements between the start of each column. This is usually equal to the number of rows in the matrix, but not always.

Once the device addresses of the data have been obtained, they can be passed to an external function. If this function modifies the values of an array, this should be signaled by invalidating the array's local data. Otherwise, an outdated local copy of the data may be used when retrieving the results. This can be done with a call to Invalidate(DistributedDataLocation).

The CUDA provider has an overloaded Copy() method that can copy from device to host, host to device, and device to device.