My last post showed that it's now possible to call code written in Harlan from C++ programs. Sadly, the performance numbers I posted were pretty embarrassing. On the bright side, when you have a 20-30x slowdown like we saw before, it's usually pretty easy to get most of that back. In this post, we'll see how. The performance still isn't where I'd like to be, but when we're done today, we'll only be seeing about a 4x slowdown relative to CUBLAS.

The first step is to find out where we are spending our time so we know what to try and optimize. I had a hunch that I was pretty sure about, but it's good to have some data to confirm our suspicions. We start with the code from last time, but add timing statements like this:

   (define (harlan_dot N pa pb)
-    (let* ((a (import-float-vec pa N))
+    (let* ((t0 (time-s))
+           (a (import-float-vec pa N))
+           (t1 (time-s))
            (b (import-float-vec pb N))
+           (t2 (time-s))
            (dot
-            (reduce + (kernel ((a a) (b b)) (* a b)))))
+            (reduce + (kernel ((a a) (b b)) (* a b))))
+           (t3 (time-s)))
+      (print "Import vector a time: ") (println (- t1 t0))
+      (print "Import vector b time: ") (println (- t2 t1))
+      (print "Kernel time: ") (println (- t3 t2))
       dot)))

The function has two phases. First, it has to copy the data in from the host program. Secondly, it launches a kernel to actually perform the dot product computation. To get a very non-scientific idea of how we are spending our time, we just have to run the program a bunch of times. A typical run on my machine generates results like this (note that these results are using OpenCL on the CPU instead of the GPU):

Import vector a time: 0.25
Import vector b time: 0.25
Kernel time: 0.375
             harlan_dot 0.856121

We can see that we're actually spend the majority of our time copying the two input vectors around. This accounts for half a second, while the actual computation time is only 375ms. Incidentally, it's not terribly uncommon to spend more time shuffling data around than actually computing on that data. This is why techniques like zero copy are so important.

We'll try two things to improve the performance: ensuring compiler optimizations are enabled, and making compiler optimizations work better.

The impact of compiler optimizationsšŸ”—

There's an adage that says there's a difference between optimization and not being stupid. This "optimization" truly falls in the not being stupid stupid category. The function import-float-vec relies on the unsafe-deref-float and unsafe-set!-float functions, which I implementated builtin functions. Thus, most of the time in these functions is actually spent in the runtime. When I checked the Makefile for the runtime, I realized I had never enabled compiler optimizations. My thinking was probably something like "oh, we don't need optimizations, since all the interesting computation in Harlan happens in OpenCL kernels." It was easy enough to pass the -O2 flag to the compiler and see what happens. After this change, the results look more like this:

Import vector a time: 0.25
Import vector b time: 0.125
Kernel time: 0.375
             harlan_dot 0.774522

We've shaved off about 82ms, which is actually more than I expected. Also, we see a bit of an artifact in the data, which is that importing vector b takes half the time of vector a, even though they are both the same size. It seems the timer I'm using only has a resolution of about an eighth of a second. This isn't great, but it's good enough for our purposes.

Coding for better optimizationšŸ”—

The body for the unsafe-deref-float and unsafe-set!-float functions are one line of code. They are a single memory reference each. These are cheap and really should just be open coded by the compiler. I didn't do it this way at first because I would have had to push a new language form through a lot of the compiler and wanted something quick and dirty. This would be easier if more of the compiler used the Nanopass framework. However, these functions are prime candidates for inlining, and compilers are pretty good at inlining. Writing these functions this way shouldn't be a problem because the compiler will just inline them away. After all, just look at this section of the disassembly from import-float-vec and see how there are no procedure calls:

   0x00007ffff42afa08 <+296>:   mov    %r12d,%esi
   0x00007ffff42afa0b <+299>:   mov    %rbp,%rdi
   0x00007ffff42afa0e <+302>:   callq  0x7ffff42af270 <_Z18unsafe$deref$floatPfi@plt>
   0x00007ffff42afa13 <+307>:   mov    %r12d,%esi
   0x00007ffff42afa16 <+310>:   mov    %r13,%rdi
   0x00007ffff42afa19 <+313>:   add    $0x1,%r12d
   0x00007ffff42afa1d <+317>:   callq  0x7ffff42af470 <_Z18unsafe$set$b$floatPfif@plt>
   0x00007ffff42afa22 <+322>:   cmp    %r12d,%ebx
   0x00007ffff42afa25 <+325>:   jg     0x7ffff42afa08 <_Z16import$float$vecPfiRP7region_+296>

Oh, just kidding! There are calls to unsafe-deref-float and unsafe-set!-float. Since these functions are just a single memory reference, I wouldn't be surprised if we're paying ten times the cost of the function in just call overhead.

The problem is that the compiler cannot inline these functions because they are in a separate library. Instead, we want to include these functions in the source code we are compiler. The easiest way to do this is to move their definition to harlan.hpp, which is included in every Harlan program. After making this change, the performance looks something like this:

Import vector a time: 0.125
Import vector b time: 0
Kernel time: 0.375
             harlan_dot 0.522239

Now we're spending the majority of time in the kernel, which is what we'd like.

Here's the chart from last time, along with our new results.

Execution time for dot product on 33,554,432 element vectors (shorter bars are better).

We've got a ways to go still, but this is a significant improvement over where we were before. The next step is to dig deeper into the why the kernel time is as large as it is.