The “Sufficiently Clever Compiler” has become something of a trope in the Lisp community: the mythical beast that promises language and interface designers near-unlimited freedom, and leaves their output in a performance lurch by its non-appearance. A few years ago, I was young enough to join a research project to build one of these things. Neglecting a raft of asterisks, footnotes, and caveats, we ended up making something whose essence is pretty impressive: you pay for abstraction boundaries in compile-time resources, but they end up free at runtime. One prototype was just open-sourced recently, so that makes now a good time to talk about it.
Permit me to begin my musings on this subject with an example. The next element on this page is a Javascript Canvas object, whereon, upon your clicking the button labeled “Go!”, your browser will draw a progressively-refined rendition of the Mandelbrot set, anti-aliased for your viewing pleasure. It will also display how long each iteration took to compute, both in real time and in count of floating point arithmetic operations.1
Resolution | Mflops | Time (ms) |
---|
I obviously don’t know how fast this demo runs on your computer, but on mine, in Firefox 23.0, with asm.js turned on, I get a little under 1 Mflop/ms. Given that my CPU is 1.6 GHz, that’s an average of a little over half a floating-point operation per clock cycle. Not bad for a single-threaded Javascript program, eh?2
What do you think the inner loop of that program looks like? You could be forgiven for thinking that it looks something like this:
function fol_program(stdlib, foreign, heap) {
"use asm";
var heap_view = new stdlib.Float32Array(heap);
var sqrt = stdlib.Math.sqrt;
function op_231(count, c_x, c_y, z_x, z_y) {
count = +count; c_x = +c_x; c_y = +c_y;
z_x = +z_x; z_y = +z_y;
if (count <= 0.0) {
heap_view[0] = z_x;
heap_view[1] = z_y;
return;
} else {
op_231(count - 1.0, c_x, c_y,
((z_x*z_x - z_y*z_y) + c_x),
((z_x*z_y + z_y*z_x) + c_y));
return;
}
}
function __main__(x, y) {
x = +x; y = +y;
var z_x = 0.0; var z_y = 0.0;
op_231(400.0, x, y, 0.0, 0.0);
z_x = +heap_view[0];
z_y = +heap_view[1];
return (+sqrt(z_x*z_x + z_y*z_y) < 2.0)|0;
}
return __main__;
}
If you thought that, you would not be entirely wrong, because that is, in fact, the Mandelbrot set membership test code that your browser actually executes around a million times when you click on the “Go!” button.
However, the way I want to actually write that program looks like this:
;;; Complex arithmetic library
(define (c:+ z1 z2)
(cons (+ (car z1) (car z2))
(+ (cdr z1) (cdr z2))))
(define (c:* z1 z2)
(cons (- (* (car z1) (car z2))
(* (cdr z1) (cdr z2)))
(+ (* (car z1) (cdr z2))
(* (cdr z1) (car z2)))))
(define c:0 (cons (real 0) (real 0)))
(define (magnitude z)
(sqrt (+ (* (car z) (car z))
(* (cdr z) (cdr z)))))
;;; Iteration library
(define (iterate count f x)
(if (<= count 0)
x
(iterate (- count 1) f (f x))))
;;; Mandelbrot set membership test
(define ((step c) z)
(c:+ (c:* z z) c))
(define (mandelbrot? c)
(< (magnitude
(iterate (real 400) (step c) c:0))
2))
Doubtless the first difference that jumps out at you is that one of these programs is written using a beautiful, familiar surface syntax that makes you feel right at home, while the other uses an ugly, foreign notation that makes your eyes glaze over. That’s not the point, though.
The important difference is that the program the browser runs (in Javascript) is much less modular than the program I want to write (in Scheme, as it happens). In the Javascript program, all the concepts relating to Mandelbrot set membership testing are mushed together: the concepts of complex arithmetic are mixed up with the concept of iterating a function, and mixed up with the definition of the function to iterate. None of them are available as named entities; if you wanted to solve the problem a different way, you would have to rewrite the entire thing from scratch.
The very advantage of modularity is the source of its performance cost: modularity is the ability to do things other than what you did.
In the Scheme program, however, they are separate definitions, given their own separate names. Complex arithmetic is defined in one place, independent of how you choose to use it. The step function defining the Mandelbrot recurrence is defined in another place, available to be used any which way you want. The idea of iterating a function for a while is defined in another place, independent of the function to iterate. A change of algorithm becomes a matter of rearranging the available pieces, rather than of rewriting the whole program.
And there we have a clue as to why, in modern computing, modular code is not high-performance code. The very advantage of modularity is the source of its performance cost: modularity is the ability to do things other than what you did. Modularity is the ability to reuse the same idea in many different ways and in many different places.
The cost that modularity carries is generic linkages between the “idea” and the “places”. That linkage carries its own cost at runtime, but more importantly, it impedes other optimizations. In our example, iterate
accepts an arbitrary function f
. That means the call (f x)
in iterate has to be set up to call an arbitrary function. It also means that its argument and its return value have to be communicated using completely generic mechanisms.
Symmetrically, the function produced by (step c)
, being a first-class object in Scheme, can be used anywhere. That means it has to accept its argument and produce its return value using completely generic mechanisms, that match the generic call sites. It also means that we cannot optimize it by changing its interface, because we do not know the places where the complementary changes would need to be made. So (step c)
must accept its complex number argument and return its complex number result each as a single value, presumably machine-word-sized; so those complex numbers have to be allocated and dereferenced; and the cost of making those data structures all the time dwarfs the cost of the arithmetic we are trying to do.
“But wait!” I can hear you say to yourself, “I know what function iterate
is iterating, and I know where (step c)
is called. Why is my compiler so dumb as not to notice that mandelbrot?
calls for an iteration of exactly (step c)
, and produce code for it that takes this into account? Then it would be clear that the complex number produced by one call to (step c)
will just be consumed by the next, and the program can avoid allocating a silly two-element structure for it by just passing the two parts in variables.” (Which is, of course, how the Javascript program behaves.)
Specialization is the path from modularity to performance.
That’s right. The thing you did there was a little bit of (combined data- and control-) flow analysis, computing that iterate
gets called with (step c)
as an argument and therefore (step c)
appears as the f
in the call (f x)
inside iterate
. You also decided that, at least for this problem, it is probably worthwhile to produce a version of iterate
that is specialized to (step c)
as an argument (with due provision for operating on different values of c
). And that’s my point. I conjecture that, at this moment in history, the one remaining piece of compiler technology that human programmers instinctively expect before admitting that a compiler is Sufficiently Clever is the knack for automatically specializing the right program elements to the context of their use.
Specialization is the path from modularity to performance.3 A modular component is one that can be used in different ways in different places. A specialized component is one that is tailored to the way and place it is used, taking advantage of local context to improve efficiency; often to the point that separation between component and context is completely obscured. To the extent that specialization can be automated, doing so would let us have our cake and eat it too—write high-performance programs in a modular way.
The problem with specialization is that it comes with its own cost: if a component is used in many places, specializing it to each context requires copying it. Since the inside of a component is, of course, a set of places where other components are used, the number of contexts grows with each specialization, and, unchecked, this process can easily get out of hand. That’s why mainstream compilers are pretty conservative about it.
That being said, there are problem domains where really aggressive specialization should work wonders. The activity called “scientific computing” comes to mind: the main item of interest in these programs is some complicated, CPU- and memory-intensive (as opposed to I/O-intensive) computation involving floating-point numbers. All the fine data structures, first-class functions, and abstractions with which one might wish to write such programs are just scaffolding whose sole purpose is to get the right floating-point numbers to meet the right arithmetic operations. So it is tempting to try and see whether one could write scientific computing programs in a nice, high-level, modular way, and have the compiler solve the scaffolding at compile time (by specialization followed by standard optimizations) and produce efficient numerical code.
That was the theory behind that research project I alluded to in the introduction. Modern general-purpose compilers do not generally specialize aggressively enough to “solve the scaffolding” of any nontrivial program. In particular, automatic differentiation is an especially impressive form of scaffolding, that would be very nice to be able to use, but that tends to trigger “this is too costly to specialize” heuristics. But perhaps there is a niche for tools that specialize too much for general-purpose use, but enough to make (at least some) numerical kernels go really fast despite being very modular.
I wrote one of those. Under the supervisorship of Professor Pearlmutter and with much assistance from our collaborator Professor Siskind, the research group studied various issues with semantics and implementation of automatic differentiation (which are nontrivial), forays into compiler technology, etc. It so transpired that I ended up being the primary author of the DysVunctional Language prototype, which specializes away modularity boundaries (including a user-space automatic differentiation library) until the cows come home.
Now, I do not advocate seriously using DVL (yet), as it is unfinished (and not very actively worked on right now, for lack of funds). For one, it is meant for numerical kernels, but it does not have a sensible foreign interface. For another, it requires programmer annotations to indicate what not to specialize on (in the complete absence of which, its flow analysis will amount to running your program slowly at compile time, and the code generator will amount to writing an executable that just prints out the answer). It is also research-prototype quality software in general—for instance, a type error in the source program will give you a prompt debugging the compiler.
I am, nevertheless, quite proud of DVL. I flatter myself that its source is not too terribly hard to read, so it can serve as an example for how to do that kind of thing in a different context. And the thing it does, if you get your program to compile, feels pretty magical: as much abstraction and modularity as you could want gets stripped away, revealing the bare numerical computation for the computer to just execute. Not just on podunk 30-line Mandelbrot set programs, either—check out the celestial mechanics example.
Now, DVL is not Sufficiently Clever for general-purpose programming, because it doesn’t know what not to specialize. That’s an open problem, but it feels to me like the right place to dig.
Notes
Here is how I count floating-point operations. The Mandelbrot set, by definition, is those complex numbers \(c\) for which iterating \[z_{n+1} = z_n^2 + c\] from \(z_0 = 0\) does not escape to infinity. It’s not too hard to check that, if \(|z_n|\) ever exceeds 2, then that iteration will escape to infinity, regardless of \(c\). This suggests an algorithm for approximating membership in the Mandelbrot set: for any \(c\) of interest, run the iteration for a fixed number of steps. If the result’s magnitude exceeds 2, \(c\) is not in the Mandelbrot set; if it does not, we can approximately say that \(c\) is in the set.
The program in the demo follows this algorithm. Every iteration takes 10 arithmetic operations (six for the complex multiply, two for the complex add, and two for bookkeeping). I chose to go for 400 iterations per point. Every pixel tests 4 points, for the anti-aliasing. That makes \[ 4 \frac{\textrm{points}}{\textrm{pixel}} \cdot 400 \frac{\textrm{iters}}{\textrm{point}} \cdot 10 \frac{\textrm{flops}}{\textrm{iter}} = 16,\!000 \frac{\textrm{flops}}{\textrm{pixel}}, \] which is where the counts in the table come from.
Aside: It is surely computationally more efficient to check intermediate steps in the iteration for escaping rather than always going a fixed number of steps, but that makes counting floating point operations harder, because you do not know how many get skipped. That’s why I chose not to do that for this exercise.↩
As of this writing, the best program for computing the Mandelbrot set on the Computer Language Benchmarks Game (which happened to be written in Fortran 90 and compiled with the Intel Fortran compiler) was only six times faster than the demo you see here. I do not know how much of that speedup was due to doing less actual computation—unlike this demo, that benchmark does truncate its iterations if they escape early, but the exact savings from this are hard to measure.↩
One quip among compiler writers that expresses this sentiment is “inlining is the mother of all optimizations”. Inlining is the act of replacing a reference to some variable with that variable’s definition; putting the definition “in line”, as it were. This is particularly helpful if the variable in question refers to a function.
Inlining can cause acceleration in its own right (for example, by eliminating the runtime cost of referring to the variable that was inlined), but the reason it is considered so important is because inlining a variable that is used in more than one place accomplishes specialization. Inlining a definition also constructs a proof that the specialized definition is used in exactly the one place where it is used, but bringing the definition to the use and deleting the reference is not the only way to do that. The idea of specialization is also applicable more widely than inlining, because it is possible (and desirable) to be able to specialize to clusters of similar contexts, without further specializing to each of these contexts individually.↩