My last post showed that it’s now possible to call code written in Harlan from C++ programs. Sadly, the performance numbers I posted were pretty embarrassing. On the bright side, when you have a 20-30x slowdown like we saw before, it’s usually pretty easy to get most of that back. In this post, we’ll see how. The performance still isn’t where I’d like to be, but when we’re done today, we’ll only be seeing about a 4x slowdown relative to CUBLAS.
The first step is to find out where we are spending our time so we know what to try and optimize. I had a hunch that I was pretty sure about, but it’s good to have some data to confirm our suspicions. We start with the code from last time, but add timing statements like this:
(define (harlan_dot N pa pb) - (let* ((a (import-float-vec pa N)) + (let* ((t0 (time-s)) + (a (import-float-vec pa N)) + (t1 (time-s)) (b (import-float-vec pb N)) + (t2 (time-s)) (dot - (reduce + (kernel ((a a) (b b)) (* a b))))) + (reduce + (kernel ((a a) (b b)) (* a b)))) + (t3 (time-s))) + (print "Import vector a time: ") (println (- t1 t0)) + (print "Import vector b time: ") (println (- t2 t1)) + (print "Kernel time: ") (println (- t3 t2)) dot)))
The function has two phases. First, it has to copy the data in from the host program. Secondly, it launches a kernel to actually perform the dot product computation. To get a very non-scientific idea of how we are spending our time, we just have to run the program a bunch of times. A typical run on my machine generates results like this (note that these results are using OpenCL on the CPU instead of the GPU):
Import vector a time: 0.25 Import vector b time: 0.25 Kernel time: 0.375 harlan_dot 0.856121
We can see that we’re actually spend the majority of our time copying the two input vectors around. This accounts for half a second, while the actual computation time is only 375ms. Incidentally, it’s not terribly uncommon to spend more time shuffling data around than actually computing on that data. This is why techniques like zero copy are so important.
We’ll try two things to improve the performance: ensuring compiler optimizations are enabled, and making compiler optimizations work better.
The impact of compiler optimizations
There’s an adage that says there’s a difference between optimization
and not being stupid. This “optimization” truly falls in the not being
stupid stupid category. The function
import-float-vec relies on the
unsafe-set!-float functions, which I
implementated builtin functions. Thus, most of the time in these
functions is actually spent in the runtime. When I checked the
Makefile for the runtime, I realized I had never enabled compiler
optimizations. My thinking was probably something like “oh, we don’t
need optimizations, since all the interesting computation in Harlan
happens in OpenCL kernels.” It was easy enough to pass the
to the compiler and see what happens. After this change, the results
look more like this:
Import vector a time: 0.25 Import vector b time: 0.125 Kernel time: 0.375 harlan_dot 0.774522
We’ve shaved off about 82ms, which is actually more than I
expected. Also, we see a bit of an artifact in the data, which is that
b takes half the time of vector
a, even though
they are both the same size. It seems the timer I’m using only has a
resolution of about an eighth of a second. This isn’t great, but it’s
good enough for our purposes.
Coding for better optimization
The body for the
functions are one line of code. They are a single memory reference
each. These are cheap and really should just be open coded by the
compiler. I didn’t do it this way at first because I would have had to
push a new language form through a lot of the compiler and wanted
something quick and dirty. This would be easier if more of the
compiler used the Nanopass framework. However, these functions are
prime candidates for inlining, and compilers are pretty good at
inlining. Writing these functions this way shouldn’t be a problem
because the compiler will just inline them away. After all, just look
at this section of the disassembly from
import-float-vec and see how
there are no procedure calls:
0x00007ffff42afa08 <+296>: mov %r12d,%esi 0x00007ffff42afa0b <+299>: mov %rbp,%rdi 0x00007ffff42afa0e <+302>: callq 0x7ffff42af270 <_Z18unsafe$deref$floatPfi@plt> 0x00007ffff42afa13 <+307>: mov %r12d,%esi 0x00007ffff42afa16 <+310>: mov %r13,%rdi 0x00007ffff42afa19 <+313>: add $0x1,%r12d 0x00007ffff42afa1d <+317>: callq 0x7ffff42af470 <_Z18unsafe$set$b$floatPfif@plt> 0x00007ffff42afa22 <+322>: cmp %r12d,%ebx 0x00007ffff42afa25 <+325>: jg 0x7ffff42afa08 <_Z16import$float$vecPfiRP7region_+296>
Oh, just kidding! There are calls to
unsafe-set!-float. Since these functions are just a single memory
reference, I wouldn’t be surprised if we’re paying ten times the cost
of the function in just call overhead.
The problem is that the compiler cannot inline these functions because
they are in a separate library. Instead, we want to include these
functions in the source code we are compiler. The easiest way to do
this is to move their definition to
harlan.hpp, which is included in
every Harlan program. After making this change, the performance looks
something like this:
Import vector a time: 0.125 Import vector b time: 0 Kernel time: 0.375 harlan_dot 0.522239
Now we’re spending the majority of time in the kernel, which is what we’d like.
Here’s the chart from last time, along with our new results.
We’ve got a ways to go still, but this is a significant improvement over where we were before. The next step is to dig deeper into the why the kernel time is as large as it is.