Source: https://github.com/aleksei-berezkin/mandelbrot
The Mandelbrot set is one of the best known mathematical fractals. With very simple equations, it plots convoluted and beautiful curves, which is why visualizing it is a popular programming task. However, don't be misled — the seemingly trivial formulas hide a wealth of complexity 😉 It is the story of striving for performance and precision.
The math
The fractal is plotted on a complex plane. Each pixel color depends on if/when the diverges:
where and are pixel coordinates on a screen. We may rewrite to more familiar real numbers, having substituted :
The is said to diverge when the following condition is met:
because subsequent items will inevitably grow to infinity. The pixel color is coded in the following way:
- If the never diverges, the color is black
- If the diverges at iteration , the color is some function of
So, let's write some JavaScript?
It's quite easy to write several dozen simple JS lines to get it working. But if you try, you quickly discover where the real problems are:
- When zooming into the set, the required number of iterations rapidly grows, causing your animation to first lose smoothness and then freeze
- At a certain level of zoom, the JS
number
(64-bit floating point) becomes too imprecise, and the picture becomes pixelated
How to approach these? Let's see!
1. Parallel workers
These days, even inexpensive smartphones have several CPU cores, and utilizing them for computation-heavy tasks is a must. For web-based tasks, you can take advantage of the multiple CPU cores by running parallel workers. Additionally, there is an API that suggests the optimal number of workers to use.
For my case, I divide the whole picture into approximately 8*workersNumber rectangles, and shuffle them for better load balance. This picture shows different workers' tasks with different hues:
2. From JS to Webassembly
Webassembly (WASM) is the format for binary instructions that browsers can execute very effectively. You can write WASM programs with an assembly-like language or compile to WASM from a high-level language. I chose AssemblyScript, which is very similar to JS and TS and is well documented.
As a low-level format, raw WASM lacks convenient data structures such as arrays and even strings, and requires manual memory management, which can be quite complex. Much of WASM compilers can emulate these features for you, but of course, that comes at a cost — each additional feature consumes valuable CPU. That's why, if you strive for performance, it's better not to use these features. That's exactly what I did. Here's a function which reads the rendered value of a pixel directly from memory:
function loadRendered(xy: u64): u32 {
const y = (xy >>> 32) as u32;
const x = xy as u32;
return load<u32>(4 * (y * canvasW + x));
}
Is it TS? Or maybe C? 🙂
3. From number to arbitrary-precision arithmetic
That's the image of 10^14 zoom when used only 64-bit floating point numbers:
Looks like an UFO trace from early 90s game 🙂
To address this problem we need to extend the width of a number which is known as arbitrary-precision arithmetic. Unfortunately, there's no such thing in WASM standard library, so I implemented my own heavily fitted for the problem.
Very generally, a number consists of multiple 32-bit “digits”:
| integer | fractional part
| part |
-------------+----------+----------------------------
32-bit items | digit0 | digit1 digit2 digit3 ...
bits | 0-31 | 32-63 64-95 96-127
In fact that's similar to how we write fractional numbers, but “digits” are not 10-based but 2^32-based instead. Or, 2-based, if we think of every bit as a “digit”; the decimal values of such 2-based “digits” are: 2^31, 2^30, ..., 2, 1, (decimal separator), 1/2, 1/4, ...
Here is the example:
decimal = 10.5
hex = a.8
binary = 1010.1
And another:
hex = c.8000_0000_8 (width = 3)
decimal = 12
+ 1/2 (bit 32)
+ 1/2^33 (bit 64)
≈ 12.5000000001
Arithmetic operations
Operations work the same way we were taught in primary school. For example, the addition:
a0 a1 a2 ...
+ b0 b1 b2 ...
============
c0 c1 c2 ...
Note: the overflow (carry-out from
c0
) is discarded because it's not needed for the task.
How to write the code for this with arbitrary width? Well, there are two ways:
Operands in memory: use loops
Each number is stored in consecutive memory cells, and we may use indexed loops to access digits:
function add(a: u32, b: u32, c: u32, width: u32) {
let carryOver: u64 = 0;
for (let i: i32 = width - 1; i >= 0; i--) {
let x: u64 = (load<u32>(a + i) as u64)
+ (load<u32>(b + i) as u64)
+ carryOver;
carryOver = x >>> 32;
store<u32>(c + i, x as u32);
}
}
Operands are local vars: unroll loops
const width = 3
let a0: u64, a1: u64, a2: u64,
b0: u64, b1: u64, b2: u64,
c0: u64, c1: u64, c2: u64,
carryOver: u64;
// ...
carryOver = 0;
c2 = a2 + b2;
carryOver = c2 >>> 32;
c2 &= 0xffffffff;
c1 = a1 + b1 + carryOver;
carryOver = c1 >>> 32;
c1 &= 0xffffffff;
c0 = a0 + b0 + carryOver;
c0 &= 0xffffffff;
Note: although digits are 32-bit, variables are 64-bit to retain carry-over.
Which is faster?
The approach with unrolled loops is faster (15-20% for my case). However, it requires having the separate code for each width
. Happily, when the code is almost the same, you can easily generate it. I generate computations for width
up to 12. I doubt anyone zooms deeper, but if you do, please file an issue 😉
Now, the fixed image. Indeed an UFO but much nicer 🙂
3. WASM vectorization
“Single instruction, multiple data” (SIMD) or “vectorization” is simply when your CPU executes the same instructions against multiple variables at once. In WASM you work with 128-bit vectors that contain 2 to 16 data items (“lanes”) depending on their size. For example, if your data items are 8-bit integers, one vector contains 16 lanes. In my case, data items are 64-bit (floating-point or integer), so I'm able to compute series for 2 pixels at once.
Here's the example. First, non-vectorized code for 64-bit float:
for (let i: u32 = 0; i < maxIterations; i++) {
const xSqr: f64 = x * x;
const ySqr: f64 = y * y;
if (xSqr + ySqr > 4) {
// Diverged
break;
}
// Compute next (x, y) ...
}
Now the vectorized version:
for (let i: u32 = 0; i < maxIterations; i++) {
const xSqr = v128.mul<f64>(x, x);
const ySqr = v128.mul<f64>(y, y);
const g4 = v128.g<f64>(v128.add<f64>(xSqr, ySqr), FOUR);
if (!point0DivergedAt && v128.extract_lane<u64>(g4, 0)) {
point0DivergedAt = i;
}
if (!point1DivergedAt && v128.extract_lane<u64>(g4, 1)) {
point1DivergedAt = i;
}
if (point0DivergedAt && point1DivergedAt) {
// Both diverged
break;
}
// Compute next (x, y) ...
}
How faster did it get?
Floating-point arithmetic became exactly 2x faster. BigNumber got only 15–20% faster, and I don't know why, honestly. Perhaps my code has a problem that I cannot spot, but I hope to find it someday.
4. Optimization: filling same-color rectangles
Note how many same-color areas there are in the following example:
Computing almost same series for each pixel seems ineffective. Is it possible to optimize it? Happily, yes. With this optimization, rendering is coded as a recursive process of painting nested rectangles. The first rectangle is the whole render task given to the worker. Then, the recursive algorithm goes like this:
- Render only the contour of the current rectangle + mid point; or, for small rectangles, only corners + mid point
- If all pixels are of the same color:
- Fill the entire rectangle with that color
- Otherwise:
- If the current rectangle is still not too small, split the current rectangle into two and recursively render each one
- Otherwise, render all pixels of the current rectangle as usual
- If all pixels are of the same color:
Here's the same image with fast-filled rectangles outlined:
This enormously boosts performance, especially for “stable” parts of the image, i.e., parts that do not reveal too much color variability.
Conclusion: is it finally fast enough?
The issue with the Mandelbrot set is that, as the iteration count and the BigNumber width constantly increase, it is not possible to eliminate lags, but only to postpone them. Optimizations allow you to explore it more deeply, but inevitably there will come a point where the required amount of computations exceeds any CPU power. This means that there will always be a room for optimizations, but at present, I seem to have exhausted all available options. Or, am I wrong? Can you suggest anything to improve? And thank you for reading!
Top comments (2)
That's awesome!
Nice visualization!
Glad you liked it!