Compiling Rust for GPUs
A couple of days back, I tweeted that I had just ran code written in Rust on the GPU. It's about time I provided some more details. This is a project I worked on with Milinda Pathirage, a fellow student at IU. I should emphasize that this is very much in the proof of concept stage. I doubt it will work well enough to do anything useful, but it does work well enough to do something and it would certainly be possible to extend this. That said, I will include links to our code so the valiant hackers out there can try it out if they wish. For posterity's sake, here is, to my knowledge, the first fragment of Rust code to ever execute on a GPU:
#[kernel]
fn add_float(x: &float, y: &float, z: &mut float) {
*z = *x + *y;
}
There are two main parts to this project. The first is compiling Rust
code into something suitable for running on the GPU. We do this using
the PTX backend that is part of LLVM. The second part is loading and
executing the kernel. For this, we use OpenCL and its
clCreateProgramWithBinary
API. In this
post, I'll focus on the issues encountered with generating PTX code.
The bulk of the work to generate PTX code was already done by the
NVPTX backend that was recently contributed to
LLVM by NVIDIA. We started out with a very manual process. First we
used the --emit-llvm
flag for rustc
to save the generated LLVM
bitcode. From there, we attempt to compile as PTX using llc
:
llc -march=nvptx -mcpu=sm_13 trivial-kernel.ll -o trivial-kernel.ptx
I wasn't terribly surprised to see this fail with one of LLVM's typically opaque error messages. You can see it here if you wish. Basically, Rust was generating code that the NVPTX backend didn't know how to handle. This makes sense; I expect NVIDIA primarily tests the backend on code generated by CUDA, which looks different from code that Rust generates. The next step was to pare down the generated LLVM to something a little more manageable:
target datalayout = "e-p:64:64:64-i1:8:8-i8:8:8-i16:16:16-i32:32:32-i64:64:64-f32:32:32-f64:64:64-v64:64:64-v128:128:128-a0:0:64-s0:64:64-f80:128:128-n8:16:32:64"
target triple = "x86_64-apple-darwin"
%"~enum intrinsic::TyDesc[#0]" = type { [16 x i8] }
%tydesc = type { i64, i64, void (i1*, i1*, %tydesc**, i8*)*, void (i1*, i1*, %tydesc**, i8*)*, void (i1*, i1*, %tydesc**, i8*)*, void (i1*, i1*, %tydesc**, i8*)*, i8*, i8* }
define void @_ZN9add_float16_cb9e1b436595b333_00E(i1*, { i64, %tydesc*, i8*, i8*, i8 } addrspace(1)*, double*, double*, double*) uwtable {
static_allocas:
%5 = alloca double*
%6 = alloca double*
%7 = alloca double*
br label %8
return: ; preds = %8
ret void
; <label>:8 ; preds = %static_allocas
store double* %2, double** %5
store double* %3, double** %6
store double* %4, double** %7
call void asm "# *z = *x + *y; (trivial-kernel.rs:3:4: 3:16)", ""()
%9 = load double** %5
%10 = load double** %6
%11 = load double* %9
%12 = load double* %10
%13 = fadd double %11, %12
%14 = load double** %7
store double %13, double* %14
br label %return
}
Of course, LLVM still fails:
Assertion failed: (!isLiteral() && "Literal structs never have names"), function getName, file /usr/local/src/llvm/lib/VMCore/Type.cpp, line 605.
It seems that NVPTX was having trouble with the anonymous struct in
the function arguments ({ i64, %tydesc*, i8*, i8*, i8 }
). To test
this theory, I replaced that type with an i8 *
. The argument was
ignored anyway, so this shouldn't cause problems. With this change, we
ended up with a PTX file.
At point, we could either hack the Rust compiler to avoid generating code that the NVPTX backend couldn't handle, or we could improve the NVPTX backend. I opted for the latter, and ended up submitting my first ever patch to LLVM.
After another minor fix or two, it became clear that we were going to
have to modify the way Rust generates code as well. For example, the
PTX code I linked to above does not include a .entry
line, which is
required to indicate where a kernel function begins. One option would
be to add a new PTX target for Rust, and basically set it up as a
cross compiler. This isn't quite what we want. We don't want to run
all of Rust on the GPU, just a few portions of a program. Other than
the code generator, we want to PTX code to agree with the
architectural details of the host system. Instead, I added a -Zptx
flag to rustc
and started making minor changes to the translation
pass. Functions that have the #[kernel]
attribute get compiled to
use the ptx_kernel
calling convention, which tells NVPTX to add the
.entry
line. According to Patrick, we should probably use a new
ABI setting instead, as arbitrary attributes aren't part of the
function's type.
At any rate, we could now pretty reliably go from Rust to PTX without
any manual intervention. The next challenge was to execute the
kernel. When we first tried to load the PTX file, OpenCL complained
about an "invalid binary." We had previously been able to load a PTX
file generated with OpenCL and extracted using
clGetProgramInfo
, so we decided to compare the
Rust-generated code with the OpenCL-generated code. It turns out that
the parameters to the kernel were not being annotated with an address
space. We manually added .global
to the parameters in the
Rust-generated code, and we were able to load and execute the
kernel. Furthermore, we could manually annotate the LLVM code with
addrspace(1)
to get the same behavior.
For some types, Rust would have the addrspace(1)
annotation, but for
others it wouldn't. It turns out Rust was already using address spaces
for something related to garbage collection. Unfortunately, Rust and
NVPTX disagree on what these address spaces mean. To work around this,
I had Rust generate different address spaces when the -Zptx
flag is
given. At the moment these changes only take effect for &
pointers. Others, such as @
pointers will be more difficult to get
working.
The final missing piece on the code generation side of things is to
have threads be able to do different things. This means providing
equivalents of the blockIdx
, blockDim
and threadIdx
variables. These show up in LLVM as intrinsic functions, so all we
need to do is expose those as new Rust intrinsics. We expect to have
this part working soon.
Our work here shows it's possible to compile Rust to run on the GPU. We support an extremely limited subset of Rust at the moment. Most of the remaining challenges have to do with the way data is arranged in memory and how Rust provides safety at runtime. Rust uses a lot of pointer structures, and moving these between host and device memory can be difficult. Perhaps the best thing to do for now is simply to be careful about what data types we use in GPU code. Even if we use relatively flat types, however, we will still need to handle a few more things. For example, Rust does array bounds checks at runtime. If we want to allow arbitrary array indexing safely in GPU code, we'll need a way to do bounds checks and report failures from kernel code. There are clearly a lot of design issues left, but the initial results for compiling Rust to run on the GPU seem very promising.
If you want to try it out, here are links to the code.
https://github.com/eholk/rust/tree/nvptx
https://github.com/eholk/llvm/tree/nvptx-rust