• Booting to Rust

    A couple nights ago I was looking over the UEFI spec, and I realized it shouldn’t be too hard to write UEFI applications in Rust. It turns out, you can, and here I will tell you how.

    The thing that surprises me most about UEFI is that it now appears possible to boot your machine without ever writing a single line of assembly language. Booting used to require this tedious process of starting out in 16-bit real mode, then transitioning into 32-bit protected mode and then doing it all over again to get into 64-bit mode. UEFI firmwares, on the other hand, will happily load an executable file that you give it and run your code in 64-bit mode from the start. Your startup function receives a pointer to some functions that give you basic console support, as well as an API to access richer features. From a productivity standpoint, this seems like a win, but I also miss the sorcery you used to have to do when you were programming at this level.

    Booting to Rust is a lot like writing bindings to any C library, except that the linking process is a bit more involved.

  • Improving the Performance of Harlan's FFI

    My last post showed that it’s now possible to call code written in Harlan from C++ programs. Sadly, the performance numbers I posted were pretty embarrassing. On the bright side, when you have a 20-30x slowdown like we saw before, it’s usually pretty easy to get most of that back. In this post, we’ll see how. The performance still isn’t where I’d like to be, but when we’re done today, we’ll only be seeing about a 4x slowdown relative to CUBLAS.

  • Using Harlan in C++ Programs

    So far, Harlan programs have primarily existed in a vacuum. You’d compile them, run them, and they might produce some output. Certainly none of them received any input from the outside world. Most of the test cases use small sets of data, and the larger benchmarks generated incredibly synthetic data, like a vector of 16 million ones. My focus has been on building the compiler itself, so this has been a tolerable situation up to this point. However, Harlan is at the point where it needs more realistic applications and it’s clear the foreign function interface (FFI) story just won’t cut it anymore.

    I’m happy to report that it’s now possible to pass non-trivial amounts of data from C++ to Harlan. Two new features made this possible. First, there are library functions like unsafe-deref-float and unsafe-set!-float which allow reading and writing from raw pointers. Second, there’s a new primitive form called unsafe-vec-ptr which gives a raw pointer to the contents of a vector. These are very low level, but they give us the tools we need to build a reasonably usable FFI. Let’s see how to use these to implement a dot product in Harlan and use it from a C++ program.

    First, we need to write the dot product function. This is pretty short in Harlan.

      (import ffi)
      (define (harlan_dot N pa pb)
        (let ((a (import-float-vec pa N))
              (b (import-float-vec pb N)))
          (reduce + (kernel ((a a) (b b)) (* a b))))))

    For the most part, this is a straightforward dot product written in Harlan. The main new thing is the call to import-float-vec, which copies a C-style array into a Harlan vector. If you’re curious, it’s implementation is in ffi.kfc.

    Unlike most Harlan programs, this does not define a main function. Instead, we compile it to a shared library by running the following from your Harlan directory.

    ./harlanc --shared dotprod.kfc

    When this is done, you’ll have a dotprod.so which you can link to your C++ programs. The harlan_dot function is exposed under the signature float harlan_dot(int N, float *pa, float *pb).

    Now, let’s plug this function into the dot product benchmark I wrote about previously. Basically, we add a prototype for harlan_dot, then add a call to TIME(harlan_dot) along with the rest of the benchmarks. You can see the full set of changes here. I commented out the CUBLAS version because I ran into runtime errors that I didn’t feel like debugging. Below is a graph of how Harlan compares with the other implementations.

    Execution time for dot product on 33,554,432 element vectors (shorter bars are better).


    Clearly I’ve got some performance issues to deal with. On the bright side, Harlan runs faster on the GPU than it does on the CPU. I’ll be investigating these performance problems soon.

    As far as the FFI goes, there are some usability issues that remain too. For example, the Harlan compiler and the code it produces have some relative paths hard coded, which means they must run from the Harlan source directory. These shouldn’t be hard to fix. In the meantime, it’s now possible to integrate Harlan code into projects written in other languages.

  • How to write a simple Scheme debugger

    A while back, Rich Loveland asked how one might write a Scheme debugger. I realized that I’ve written many a Scheme interpreter, but I’ve never really thought about how to write a debugger. This post is a first step down that path. We’ll write what I’d consider a minimally functional debugger. It will allow you to break a running program into the debugger (albeit by invoking a primitive function in the program you’re running) and then inspect the variables that are available. As an added bonus, you’ll be able to evaluate arbitrary Scheme expressions from the debugger and even change the values of variables. For the moment, however, we will not support stepping program execution, which is admittedly an important feature of debuggers.

    We’ll start by writing an interpreter for a small but interesting subset of Scheme. Then we’ll show see how to add debugging features to the interpreter. As usual, you’ll be able to see all the code on Github.

    The Scheme Interpreter

    We’ll start with the so-called “Three Line Interpreter” and then add a few features to make it a bit more interesting. If you’re familiar with Scheme interpreters, feel free to skim this section. If you want more detail, check out the Essentials of Programming Languages. Our interpreter, value-of, takes two arguments. The first is the expression to evaluate, and the second is the environment that maps variable names onto values. Here’s the interpreter:

    (define (value-of e env)
      (match e
    	 (guard (symbol? x))
         (lookup x env))
        ((lambda (,x) ,e)
         (lambda (v) (value-of e (cons (cons x v) env))))
        ((,[e1] ,[e2])
         (e1 e2))))

    It’s a few more than three lines, but we’re mainly interested in the three match clauses. The first line is the variable reference line. If you try to evaluate something like x, we use lookup to go find the current value of x in the environment. I’ll omit the definition of lookup, but it’s basically the same as (cdr (assq x env)).

    The next line is the lambda line. This creates a procedure from a lambda expression. We cheat and use perfectly good Scheme’s built-in lambda. Notice this procedure we create takes a value, v, and then calls value-of on the body of the lambda. The important bit is the (cons (cons x v) env), which adds a binding of x to the value that was passed in to the environment, so that the interpreter can find the correct value when evaluating the body.

    Finally, we have the application line. Basically, we use match’s catamorphism feature to recursively compute the value of both the thing to be applied and the value to apply it to. Since our lambda line evaluates to Scheme procedures, we just apply the value e1 to its argument, e2.

    Although this doesn’t seem like a very rich language, you could use it to compute any computable function if you wanted to. You probably don’t want to though.

    More Features

    Now that we have the core of our interpreter, we can add features to the language. In most cases, this is as simple as adding a few more clauses to our match expression. Let’s start with numbers.

        (,n (guard (number? n)) n)
        ((,op ,[e1] ,[e2]) (guard (memq op '(+ * -)))
         ((eval op) e1 e2))

    The first line leaves numbers as they are. There’s not really much more to do with them.

    The second line lets you do things to numbers. We make sure the operation is one of +, * or -. We could add more, but these are enough for now. I cheated once again and use eval to convert the symbol representing the operator into the Scheme procedure that performs that operation. As before, we use catamorphism to evaluate the two arguments to the operator.

    To make debugging a little more interesting, let’s add some side effects. We add set! with the following match clause.

        ((set! ,x ,[e])
         (update-env! x e env))

    Of course, this isn’t very interesting without knowing what update-env! does. Here you go!

    (define (update-env! x v env)
      (if (eq? x (caar env))
          (set-cdr! (car env) v)
          (update-env! x v (cdr env))))

    Basically, we just search through the environment until we find the variable to change and then use set-cdr! to change its value.

    Finally, let’s add begin. We could simulate this with lambdas and function applications, but it’s much cleaner just to add it directly. We get begin with this clause:

        ((begin ,e* ... ,e)
           (let loop ((e* e*))
             (if (pair? e*)
                   (value-of (car e*) env)
                   (loop (cdr e*)))))
           (value-of e env)))

    We start by evaluating each of the first expressions for their effect, and then we return the value of the last expression.

    At this point, we have enough that we can start to write some reasonably interesting programs. In particular, we can write programs that have bugs that we might want to debug. Let’s add a debugger!

    The Debugger

    The first thing we’re going to do is add a way to get into the debugger. Most of the time the debugger does this by loading the source files and letting you click on the point in the code where you want a break point. This takes more UI work than I want to do right now, so instead we’ll just add a special command to our language, (debug), which breaks into the debugger. As usual, this is as simple as adding another match clause:

    	 (debugger env))

    This calls out to a function we have yet to define, called debugger. We pass in the environment debugger needs to see the environment so we can inspect it.

    The debugger itself is just a read-eval-print loop (REPL). It prompts the programmer for a command, then evaluates the command, prints out any results and then continues. Let’s start with a debugger that we can get out of:

    (define (debugger env)
      (printf "debug> ")
      (let ((cmd (read))
            (continue #t))
        (match cmd
          ((continue) (set! continue #f))
          (,else (printf "unknown command\n")))
        (if continue
            (debugger env)
            (printf "running...\n"))))

    We start out by printing debug> and then using read to read in an S-Expression. Much like our interpreter (in fact, this is just an interpreter for another language), we use match to determine which command we’re given and then evaluate it. For now we support one command, (continue), which continues the execution of the program. We do this by setting a flag that tells the debugger not to continue it’s loop. Since we were called by value-of, we just return to that function and the interpreter picks up where it left off.

    Here’s an example debugging session:

    > (value-of '(debug) '())
    debug> (continue)

    At the Scheme REPL, we call value-of with a simple expression that immediately breaks into the debugger, and we start in the empty environment ('()).

    Most debuggers give you some kind of call stack. The closest analog to that in our language is a list of all the variables in scope, so let’s add a way to see these. Once again, we add one more match clause:

          ((show-env) (show-env env))

    This just forwards to a procedure, show-env, which prints out the values in the environment. Here’s an example of how to use it:

    > (value-of '(((lambda (x)
                     (lambda (y)
                           (+ x y))))
    debug> (show-env)
    0. x: 4
    1. y: 5
    debug> (continue)

    So now we can stop the execution, continue the execution, and see what’s in the environment. What if we want to change things? We could add commands to set values, but a more powerful way is to use the target language itself to do this. Thus, we’ll add an eval command, which evaluates an expression in the debugger:

          ((eval ,e)
           (printf "~s => ~s\n" e (value-of e env)))

    Now, if we want to change values, we can just evaluate a set! expression, like this:

    > (value-of '(((lambda (x)
                     (lambda (y)
                           (+ x y))))
    debug> (show-env)
    0. x: 4
    1. y: 5
    debug> (eval (set! x 115))
    (set! x 115) => #<void>
    debug> (show-env)
    0. x: 115
    1. y: 5
    debug> (continue)

    As expected, the resulting value changes to reflect the fact that we modified the environment during program execution.

    So there’s a first steps towards a Scheme debugger. We were able to do this with relatively few changes to our interpreter. It seems to me that adding more advanced features would require more changes. For example, there’s no way to inspect or change the code that’s running now. To do this, we would have to keep track of this data in a form that is accessible to the debugger. Furthermore, we’re still missing a way to do finer grained execution control, such as stepping over a single statement or out of the current lambda. I suspect most if not all of these problems can be solved by writing our interpreter in continuation passing style. I hope to explore this in a later post.

  • Visualizing the Turing Tarpit

    Jason Hemann and I recently had a paper accepted at FARM called “Visualizing the Turing Tarpit.” The idea grew out of a talk that Jason did at our weekly PL Wonks seminar on the minimalist programming languages, Iota and Jot. At the end of the talk, Ken Shan asked whether this could be used to do some kind of cool fractal visualization of programs. That night, several of us pulled out our computers and started hacking on Iota and Jot interpreters.

    Iota and Jot are examples of Turing Tarpits, that is, languages “in which everything is possible but nothing of interest is easy.” The term comes from Alan Perlis. Turing Tarpits have some utility. The Lambda Calculus is arguably a Turing Tarpit, and yet it is quite useful in the study of programming languages and computability. Iota is notable as a Turing-complete language which consists of only two symbols. For example, i, *i*ii, and **iii are all legal programs in Iota. Sadly, some strings, such as i*i are not legal. This makes Iota less than ideal for enumerating many programs, as we can’t choose arbitrary strings but must instead be sure we follow the grammar. Jot was designed to fix this. Jot has the property that any binary string is a legal program.

    Using Jot, it’s now incredibly easy to enumerate large numbers of programs. The next step is to plot the behavior. We chose to see how many steps a program executes for before terminating (or until we give up, as not all programs will terminate), and then use this to generate a color for a certain point. Below is an animated example of what we ended up with.

  • Why Write Compilers in Scheme?

    One of the questions Klint Finley asked me for the Wired article about Harlan was “Why Scheme?” I wasn’t really satisfied with my answer, so I thought I’d answer it more completely here. Besides the fact that we were already experienced writing compilers in Scheme, Scheme has a lot of features that are very useful to compiler writers.

  • Why is Harlan called Harlan?

    One of the more unexpected things to have happened after releasing Harlan was that I was contacted by a couple of people who are named Harlan. One of the common questions about Harlan is actually where the name comes from, so I thought I’d take the time to tell the story here.

    A couple years back we started working on the design of a new language to handle GPUs. One of the initial observations was that people seem generally pretty content with languages like C or even Fortran for specifying computational code, but one of the difficulties in GPU programming is managing the data transfer between host and device memories and also within the memory hierarchy on the GPU. It seemed like this was an area that was ripe for a new language to address. Thinking about it some more, we realized the computational kernels really decide when data needs to be where, so we should let our language schedule data transfers around kernels. Indeed, the language we designed has no way of explicitly doing data transfers, but instead these are driven by the computations being done in kernels.

    It became clear that kernels would play an important part in the language we were building. The word “kernel” happens to sound a bit like “colonel,” and one particularly famous colonel is Colonel Sanders. My advisor mentioned that Colonel Sanders’ first name was Harland, but I misheard this as Harlan, and by the time we figured this out, the name Harlan had already stuck. This is also why all of the test cases have the file extension .kfc.

    It’s fitting, in a way, that we accidentally dropped the last letter from the name, given that a similar thing happened with Scheme, the language in which Harlan is implemented. Scheme was supposed to be called Schemer, but the operating system at the time limited filenames to only six characters, so the “r” was dropped.

  • Announcing the release of Harlan

    I am happy to announce that after about two years of work, I have made the code for Harlan available to the public.


    Harlan is a new programming language for GPU computing. Harlan aims to push the expressiveness of languages available for the GPU further than has been done before. Harlan already has native support for rich data structures, including trees and ragged arrays. Very soon the language will support higher order procedures.

    To give an example of the look and feel of a Harlan program, below is a simple N-Body Simulation program.

      (extern nanotime () -> u64)
      (define-datatype point3_t
        (point3 float float float))
      (define (make-points N)
        (kernel ((i (iota N)))
          (point3 (int->float i)
                  (int->float i)
                  (int->float i))))
      (define (point-diff x y)
        (match x
          ((point3 a b c)
           (match y
             ((point3 d e f)
              (point3 (- a d) (- b e) (- c f)))))))
      (define (point-add x y)
        (match x
          ((point3 a b c)
           (match y
             ((point3 x y z)
              (point3 (+ a x) (+ b y) (+ c z)))))))
      (define (point-div a y)
        (match a
          ((point3 a b c)
           (point3 (/ a y) (/ b y) (/ c y)))))
      (define (point-mag p)
        (match p
          ((point3 a b c)
           (sqrt (+ (* a a) (+ (* b b) (* c c)))))))
      (define (nbody bodies)
        (kernel ((i bodies))
          (reduce point-add
            (kernel ((j bodies))
              (let* ((diff (point-diff i j))
                     (d (point-mag diff)))
                (if (< 0 d)
                    (point-div diff (* (* d d) d))
                    (point3 0 0 0)))))))
      (define (main)
        (let* ((bodies (make-points 1000))
               (start (nanotime))
               (forces (nbody bodies))
               (stop (nanotime)))
          (print "Computed ")
          (print (length forces))
          (print " forces in ")
          (print (/ (- stop start) 1000000))
          (println "ms"))
        (return 0)))

    The language is still research quality, but I have worked hard to keep it fairly usable. I’d encourage you to try it out and let me know how it works for you!

    This is work that I’ve done with my advisor, Andrew Lumsdaine of CREST, with significant contributions from Claire Alvis, Will Byrd, Ryan Newton, and Arun Chauhan.

  • What is Macro Hygiene?

    One important, though surprisingly uncommon, feature of macro systems is that of hygiene. I mentioned in a previous post that I would eventually say something about hygiene. It turns out macro hygiene is somewhat tricky to define precisely, and I know a couple of people who are actively working on a formal definition of hygiene. The intuition behind hygiene isn’t too bad though. Basically, we want our macros to not break our code. So how can macros break code?

  • Some Simple GPU Optimizations

    One of the goals of designing a high level GPU programming language is to allow the compiler to perform optimizations on your code. One optimization we’ve been doing for a while in Harlan is one I’ve been calling “kernel fusion.” This is a pretty obvious transformation to do, and many other GPU languages do it. However, kernel fusion comes in several different variants that I’d like to discuss.