Elm is fast.
This is not because of any innovation in the compiler. In fact, the Elm compiler hardly does any optimization at all.
Elm is fast because it compiles to Javascript, which dedicated and talented engineers from all over the globe have been optimizing for more than a decade.
But here's a question: does the Elm compiler output Javascript which makes it easy for browsers to optimize it and, if not, is there any performance to gain by changing how the compiler outputs Javascript?
Let's have a look.
Hidden classes
Javascript is a dynamic language. Javascript allows objects to change in shape at any time. Variables and data structures can contain different values of different types at any time. In practice, however, most programs are fairly static and browsers try to take advantage of this.
In Chrome, every object literal and class is seen as a shape. If the shape of an object or class changes, like a property being dynamicly added or removed, Chrome will see the resultant new shape and try to convert between them. In some cases, Chrome will just treat objects as hashmaps (kinda like Elm's Dict
) if it has trouble figuring out the precise shape of something. We call these shapes for hidden classes.
We get the best performance when our Javascript code doesn't seem to deal with many different shapes at a time. Fortunately, Elm is a static language so this should be pretty common. There is, however, a case where Elm does produce different shapes that can pass as the same type. Let's look at an example:
type Maybe a
= Just a
| Nothing
This is how Elm's Maybe
is defined. It is compiled into the following Javascript (using --optimized
):
var elm$core$Maybe$Just = function (a) {
return {$: 0, a: a};
};
var elm$core$Maybe$Nothing = {$: 1};
As you can see, the Javascript object literal for Just
and Nothing
has different shapes. As a result, every Javascript code which deals with Maybe
has to be able to deal with two different shapes. But is that costly?
To measure the effect I've done two things:
1) Make a benchmark which can hopefully pick up the performance difference, then make two versions where one is handcoded to avoid the overhead.
2) Look at the assembly code which Node outputs (Node and Chrome uses the same JS engine, V8).
You can find the code for this experiment at github.
I've focused on Elm' built-in List
type, as it's used in pretty much every program. A performance improvement here would be a big benefit to everyone who uses Elm.
We'll be focusing on the following benchmark:
benchmark "int +" <|
\_ -> foldl (+) 0 intList
Simply a left fold over all the elements, adding them together. A change in performance here will tell us how quickly we can iterate through a List
, the theory being that removing any overhead dealing with multiple hidden classes should increase performance.
We compile the benchmark. Looking through the compiled JS output, we can find this piece of code:
var _List_Nil = { $: 0 };
function _List_Cons(hd, tl) { return { $: 1, a: hd, b: tl }; }
The empty List
looks different from a List
element (if you're wondering how List
s work, I explained it decently well at Elm Europe 2017).
We copy the JS file, and make the following modificaiton:
var _List_Nil = { $: 0, a: null, b: null };
function _List_Cons(hd, tl) { return { $: 1, a: hd, b: tl }; }
A List
should no longer be polymorphic (of many shapes).
The result:
- Firefox, before modification: 75,843 ops/sec
- Firefox, after modification: 84,549 ops/sec
- Safari, before modification: 248,531 ops/sec
- Safari, after modification: 248,530 ops/sec
- Chrome, before modification: 294,434 ops/sec
- Chrome, after modification: 302,569 ops/sec
So, no difference in Safari, however both Chrome and Firefox see a pretty decent improvement: ~11% in Firefox, ~4% in Chrome. Keep in mind that this is something that can be implemented in the compiler, no Elm code would have to change, it's a performance improvement for no effort on the part of application developers.
We can also look at the code that V8 generates by running the following script:
node --print-opt-code --code-comments index.js > jit_log
By reading the jit_log
for the benchmark run without modifications, we can see:
-------- Optimized code ---
optimization_id = 1
source_position = 48049
kind = OPTIMIZED_FUNCTION
name = $List$foldl
stack_slots = 10
compiler = turbofan
address = 0x28bd2bd6e9a1
Body (size = 2292)
Instructions (size = 2012)
<Assembly code>
While for the modified code we see
-------- Optimized code ---
optimization_id = 0
source_position = 48067
kind = OPTIMIZED_FUNCTION
name = $List$foldl
stack_slots = 10
compiler = turbofan
address = 0x2081135eec01
Body (size = 1848)
Instructions (size = 1600)
<Assembly code>
As expected, it generated less code when the code doesn't have to deal with polymorphism.
There is, however, something in both of this logs which give me pause:
Inlined functions (count = 1)
0x3f2705632551 <SharedFunctionInfo A2>
This section of the log lists how many functions have been inlined. In this case, only a function which evaluates the passed in function has been inlined (I'll explain what this mean in a second). This isn't actually that surprising. foldl
only contains a single function call, which is done once per loop. This function is usually never the same either, so it makes sense that it wasn't inlined. However, if you look at all the other functions that have been optimized, the only functions which are inlined are either functions which take a single argument (also called arity-1 functions) or functions called A2, A3, A4 etc.
What gives?
Inlining
Inlining function calls (replacing the function call with the implementation of the function) is one of the most important optimizations a compiler can make. This is not necessarily because function calls are all that expensive, but because it allows the compiler to better understand what the code does and perform optimizations based on that.
Let's look at our benchmark again:
benchmark "int +" <|
\_ -> foldl (+) 0 intList
Without inlining, this would call the foldl
function, which is a loop calling a function for every element in the list. Since foldl
can accept any function, it would store the intermediary number value as a reference (even though it could be stored as a number on the stack), performing a lookup in memory each time the function is called. If we weren't storing ints as intermediary values, as it would be the case if we were folding over other things, the Javscript optimizer would likely treat all values as a generic hashmap inside of foldl
.
With inlining, however, this function is likely to be compiled down to a single javascript loop, without any function calls at all, and with specialized code for the types actually used in the loop. This is like getting the benefits of a monomorphising compiler, without actually having a monomorphishing compiler.
But why aren't many functions being inlined, and what are all these A2 things?
Currying
Elm has this concept of currying. Given the following function:
add : Int -> Int -> Int
add a b =
a + b
You can create a new function like this:
add2 : Int -> Int
add2 =
add 2
So if you call a function with enough arguments, it will execute. If you call a function without all the arguments it requires, it returns a new function which accepts the missing arguments.
This is how the above add
function is compiled to JS:
function F(arity, fun, wrapper) {
wrapper.a = arity;
wrapper.f = fun;
return wrapper;
}
function F2(fun) {
return F(2, fun, function(a) { return function(b) { return fun(a,b); }; })
var author$project$Main$add = F2(function(a, b) {
return a + b;
});
Our add
function, as well as every other Elm function, is wrapped in an object which contains the original function, the arity that function expects, and a curried function. Calling the function has to be done with A2
, which is implemented like this:
function A2(fun, a, b) {
return fun.a === 2 ? fun.f(a, b) : fun(a)(b);
}
So A2 takes a F2 object, and calls the function directly if the provided function actually takes two arguments, or does a curried call if not.
From the perspective of a javascript engine, this has a big problem: unless we do whole program analysis (which is too expensive) there's no way to know if the function itself should be called, or if it is to be called using currying. The A2 call itself can be inlined, but nothing more.
But what if we made the Elm compiler smarter? If the Elm compiler knew how many arguments a function required, it could re-write this:
A2(author$project$Main$add, 1, 2)
into
author$project$Main$add.f(1, 2)
We make a copy of our benchmark, and make these changes by hand for every function call in our benchmark, and in the functions called by the benchmark.
This time we're going to focus on the results for the following function:
benchmark "* 2" <|
\_ -> map (\a -> a * 2) intList
The result:
- Firefox, before modification: 24,291 ops/sec
- Firefox, after modification: 50,927 ops/sec
- Safari, before modification: 35,723 ops/sec
- Safari, after modification: 49,029 ops/sec
- Chrome, before modification: 39,253 ops/sec
- Chrome, after modification: 58,491 ops/sec
Pretty good. The performance in Firefox is doubled, while we're seeing ~30% improvements in Chrome and Safari.
When looking at the inlining results of the unmodified code:
Inlined functions (count = 1)
0x13f84c332341 <SharedFunctionInfo A2>
We can see there's been some changes after the modifications:
Inlined functions (count = 5)
0x1f31bec396e1 <SharedFunctionInfo $map>
0x1f31bec395a9 <SharedFunctionInfo $foldr>
0x1f31bec39541 <SharedFunctionInfo $foldrHelper>
0x1f31bec32049 <SharedFunctionInfo F2>
0x1f31bec31fe1 <SharedFunctionInfo F>
However, looking over the generated assembly code I'm seeing a lot of lines containing the following:
call 0x1e5fad48abe0 (Call_ReceiverIsNotNullOrUndefined)
When we're calling functions using someObject.f(args)
, Chrome has to make sure that someObject
isn't null
or undefined
.
I've run one more benchmark. This time I've placed functions outside of F
wrappers, and call them directly.
The result:
- Firefox, before modification: 50,927 ops/sec
- Firefox, after modification: 59,632 ops/sec
- Safari, before modification: 49,029 ops/sec
- Safari, after modification: 43,695 ops/sec
- Chrome, before modification: 58,491 ops/sec
- Chrome, after modification: 63,619 ops/sec
Chrome and Firefox see some nice speedups, ~16% for Firefox and ~8% for Chrome. Safari actually sees a slowdown but I don't know why. Re-running the benchmarks several times gives wildly different results, and I don't know what to make of that.
Conclusion
Elm is fast, but there's still room to become faster. By making changes to how the Elm compiler outputs Javascript, we can increase the performance of Elm programs by a significant margin.
Further reading
If you want to know more on how Chrome makes Javascript fast, these are the two best resources I came across:
Top comments (16)
Awesome changes! Are these changes going to be implemented soon?
The honest answer to that is: don’t know 😅
Corollary, are you helping Evan directly now?
I'm a member of the core team, so I do talk with Evan from time to time. But I still write PRs which he may, or may not, merge.
I just thought it was a core team of one, so that's cool 😁
There's actually a whole bunch: elm-lang.org/community
:)
Wonderfully informative post.
Interesting article! Can I translate it into Japanese?
Sure :)
I'm sorry there is something that I can't understand in
Inlining
section.the intermediary number value
is accumlator?we were folding over other things
mean actually? And why aren't ints stored? In my knowledge, all local variables are stored on the stack or the heap.I'm looking forward to your answer
Thanks for your quick response! I understood.
I'm so sorry I have one more question, but this is last one.
It's about this sentence.
What I understand is below:
This talks about Inline Caching. If
foldl
is inlined, the function as argument offoldl
is also inlined, so without Inline Caching code of inlinedfoldl
's loop is specilized for argument functions type.Is this right?
This talks about Inline Caching
=> inline caches is something else, inlining is just replacing a function call with the implementation of the function.without Inline Caching code of inlined foldl's loop is specilized for argument functions type
=> without inlining, the foldl's loop cannot be specialized because it has to handle all possible argument types.Thank you! You mean
monomophishing
is to specialize code by inlining function call, right? If so, I can't understand whatwithout actually having a monomorphishing compiler
means...monomorphising doesn't do inling. When a monomorphising compiler sees the following code:
it will generate two versions of
listmap
, one which has the type(Int -> String) -> List Int -> List String
and one which has the type(Int -> Bool) -> List Int -> List Bool
. It will then use the specializedlistmap
implementations where it can.This allows the compiler to specialize code better, because it has precise type information.
Oh...I misread that you used "monomorphishing" like as this article explained.
I've published my translation! zenn.dev/mshaka/articles/633027bef...
I appreciate your patience and kindness!