Compiling Rust for GPUs

A couple of days back, I tweeted that I had just ran code written in Rust on the GPU. It's about time I provided some more details. This is a project I worked on with Milinda Pathirage, a fellow student at IU. I should emphasize that this is very much in the proof of concept stage. I doubt it will work well enough to do anything useful, but it does work well enough to do something and it would certainly be possible to extend this. That said, I will include links to our code so the valiant hackers out there can try it out if they wish. For posterity's sake, here is, to my knowledge, the first fragment of Rust code to ever execute on a GPU:

#[kernel]
fn add_float(x: &float, y: &float, z: &mut float) {
    *z = *x + *y;
}

There are two main parts to this project. The first is compiling Rust code into something suitable for running on the GPU. We do this using the PTX backend that is part of LLVM. The second part is loading and executing the kernel. For this, we use OpenCL and its clCreateProgramWithBinary API. In this post, I'll focus on the issues encountered with generating PTX code.

The bulk of the work to generate PTX code was already done by the NVPTX backend that was recently contributed to LLVM by NVIDIA. We started out with a very manual process. First we used the --emit-llvm flag for rustc to save the generated LLVM bitcode. From there, we attempt to compile as PTX using llc:

llc -march=nvptx -mcpu=sm_13 trivial-kernel.ll -o trivial-kernel.ptx

I wasn't terribly surprised to see this fail with one of LLVM's typically opaque error messages. You can see it here if you wish. Basically, Rust was generating code that the NVPTX backend didn't know how to handle. This makes sense; I expect NVIDIA primarily tests the backend on code generated by CUDA, which looks different from code that Rust generates. The next step was to pare down the generated LLVM to something a little more manageable:

target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-apple-darwin"

%"~enum intrinsic::TyDesc[#0]" = type { [16 x i8] }
%tydesc = type { i64, i64, void (i1*, i1*, %tydesc**, i8*)*, void (i1*, i1*, %tydesc**, i8*)*, void (i1*, i1*, %tydesc**, i8*)*, void (i1*, i1*, %tydesc**, i8*)*, i8*, i8* }

define void @_ZN9add_float16_cb9e1b436595b333_00E(i1*, { i64, %tydesc*, i8*, i8*, i8 } addrspace(1)*, double*, double*, double*) uwtable {
static_allocas:
  %5 = alloca double*
  %6 = alloca double*
  %7 = alloca double*
  br label %8

return:                                           ; preds = %8
  ret void

; <label>:8                                       ; preds = %static_allocas
  store double* %2, double** %5
  store double* %3, double** %6
  store double* %4, double** %7
  call void asm "# *z = *x + *y; (trivial-kernel.rs:3:4: 3:16)", ""()
  %9 = load double** %5
  %10 = load double** %6
  %11 = load double* %9
  %12 = load double* %10
  %13 = fadd double %11, %12
  %14 = load double** %7
  store double %13, double* %14
  br label %return
}

Of course, LLVM still fails:

Assertion failed: (!isLiteral() && "Literal structs never have names"), function getName, file /usr/local/src/llvm/lib/VMCore/Type.cpp, line 605.

It seems that NVPTX was having trouble with the anonymous struct in the function arguments ({ i64, %tydesc*, i8*, i8*, i8 }). To test this theory, I replaced that type with an i8 *. The argument was ignored anyway, so this shouldn't cause problems. With this change, we ended up with a PTX file.

At point, we could either hack the Rust compiler to avoid generating code that the NVPTX backend couldn't handle, or we could improve the NVPTX backend. I opted for the latter, and ended up submitting my first ever patch to LLVM.

After another minor fix or two, it became clear that we were going to have to modify the way Rust generates code as well. For example, the PTX code I linked to above does not include a .entry line, which is required to indicate where a kernel function begins. One option would be to add a new PTX target for Rust, and basically set it up as a cross compiler. This isn't quite what we want. We don't want to run all of Rust on the GPU, just a few portions of a program. Other than the code generator, we want to PTX code to agree with the architectural details of the host system. Instead, I added a -Zptx flag to rustc and started making minor changes to the translation pass. Functions that have the #[kernel] attribute get compiled to use the ptx_kernel calling convention, which tells NVPTX to add the .entry line. According to Patrick, we should probably use a new ABI setting instead, as arbitrary attributes aren't part of the function's type.

At any rate, we could now pretty reliably go from Rust to PTX without any manual intervention. The next challenge was to execute the kernel. When we first tried to load the PTX file, OpenCL complained about an "invalid binary." We had previously been able to load a PTX file generated with OpenCL and extracted using clGetProgramInfo, so we decided to compare the Rust-generated code with the OpenCL-generated code. It turns out that the parameters to the kernel were not being annotated with an address space. We manually added .global to the parameters in the Rust-generated code, and we were able to load and execute the kernel. Furthermore, we could manually annotate the LLVM code with addrspace(1) to get the same behavior.

For some types, Rust would have the addrspace(1) annotation, but for others it wouldn't. It turns out Rust was already using address spaces for something related to garbage collection. Unfortunately, Rust and NVPTX disagree on what these address spaces mean. To work around this, I had Rust generate different address spaces when the -Zptx flag is given. At the moment these changes only take effect for & pointers. Others, such as @ pointers will be more difficult to get working.

The final missing piece on the code generation side of things is to have threads be able to do different things. This means providing equivalents of the blockIdx, blockDim and threadIdx variables. These show up in LLVM as intrinsic functions, so all we need to do is expose those as new Rust intrinsics. We expect to have this part working soon.

Our work here shows it's possible to compile Rust to run on the GPU. We support an extremely limited subset of Rust at the moment. Most of the remaining challenges have to do with the way data is arranged in memory and how Rust provides safety at runtime. Rust uses a lot of pointer structures, and moving these between host and device memory can be difficult. Perhaps the best thing to do for now is simply to be careful about what data types we use in GPU code. Even if we use relatively flat types, however, we will still need to handle a few more things. For example, Rust does array bounds checks at runtime. If we want to allow arbitrary array indexing safely in GPU code, we'll need a way to do bounds checks and report failures from kernel code. There are clearly a lot of design issues left, but the initial results for compiling Rust to run on the GPU seem very promising.

If you want to try it out, here are links to the code.

https://github.com/eholk/rust/tree/nvptx
https://github.com/eholk/llvm/tree/nvptx-rust