Aleksei Berezkin

Posted on May 13, 2023

Show DEV: Interactive Mandelbrot Set Visualization

#webdev #webassembly #showdev

Source: https://github.com/aleksei-berezkin/mandelbrot

The Mandelbrot set is one of the best known mathematical fractals. With very simple equations, it plots convoluted and beautiful curves, which is why visualizing it is a popular programming task. However, don't be misled — the seemingly trivial formulas hide a wealth of complexity 😉 It is the story of striving for performance and precision.

The math

The fractal is plotted on a complex plane. Each pixel color depends on if/when the $z$ diverges:

\begin{align} z_0 &= 0 \newline z_{n+1} &= z_n^2 + x_0 + {y_0}i, \end{align}

where $x_0$ and $y_0$ are pixel coordinates on a screen. We may rewrite $(2)$ to more familiar real numbers, having substituted $z_n = x_n + y_ni$ :

\begin{align} x_{n+1} &= x_n^2 - y_n^2 +x_0 \newline y_{n+1} &= 2x_n y_n + y_0 \end{align}

The $z$ is said to diverge when the following condition is met:

\begin{align} |z_n| &> 2 \newline \iff \sqrt{x_n^2 + y_n^2} &> 2 \newline \iff x_n^2 + y_n^2 &> 4, \end{align}

because subsequent items will inevitably grow to infinity. The pixel color is coded in the following way:

If the $z$ never diverges, the color is black
If the $z$ diverges at iteration $n$ , the color is some function of $n$

So, let's write some JavaScript?

It's quite easy to write several dozen simple JS lines to get it working. But if you try, you quickly discover where the real problems are:

When zooming into the set, the required number of iterations rapidly grows, causing your animation to first lose smoothness and then freeze
At a certain level of zoom, the JS number (64-bit floating point) becomes too imprecise, and the picture becomes pixelated

How to approach these? Let's see!

1. Parallel workers

These days, even inexpensive smartphones have several CPU cores, and utilizing them for computation-heavy tasks is a must. For web-based tasks, you can take advantage of the multiple CPU cores by running parallel workers. Additionally, there is an API that suggests the optimal number of workers to use.

For my case, I divide the whole picture into approximately 8*workersNumber rectangles, and shuffle them for better load balance. This picture shows different workers' tasks with different hues:

2. From JS to Webassembly

Webassembly (WASM) is the format for binary instructions that browsers can execute very effectively. You can write WASM programs with an assembly-like language or compile to WASM from a high-level language. I chose AssemblyScript, which is very similar to JS and TS and is well documented.

As a low-level format, raw WASM lacks convenient data structures such as arrays and even strings, and requires manual memory management, which can be quite complex. Much of WASM compilers can emulate these features for you, but of course, that comes at a cost — each additional feature consumes valuable CPU. That's why, if you strive for performance, it's better not to use these features. That's exactly what I did. Here's a function which reads the rendered value of a pixel directly from memory:

function loadRendered(xy: u64): u32 {
  const y = (xy >>> 32) as u32;
  const x = xy as u32;
  return load<u32>(4 * (y * canvasW + x));
}

Is it TS? Or maybe C? 🙂

3. From number to arbitrary-precision arithmetic

That's the image of 10^14 zoom when used only 64-bit floating point numbers:

Looks like an UFO trace from early 90s game 🙂

To address this problem we need to extend the width of a number which is known as arbitrary-precision arithmetic. Unfortunately, there's no such thing in WASM standard library, so I implemented my own heavily fitted for the problem.

Very generally, a number consists of multiple 32-bit “digits”:

             |  integer | fractional part
             |  part    | 
-------------+----------+----------------------------
32-bit items |  digit0  | digit1  digit2  digit3  ...
bits         |  0-31    | 32-63   64-95   96-127

In fact that's similar to how we write fractional numbers, but “digits” are not 10-based but 2^32-based instead. Or, 2-based, if we think of every bit as a “digit”; the decimal values of such 2-based “digits” are: 2^31, 2^30, ..., 2, 1, (decimal separator), 1/2, 1/4, ...

Here is the example:

decimal =   10.5
hex     =    a.8
binary  = 1010.1

And another:

hex      = c.8000_0000_8 (width = 3)
decimal  = 12
           + 1/2    (bit 32)
           + 1/2^33 (bit 64)
         ≈ 12.5000000001

Arithmetic operations

Operations work the same way we were taught in primary school. For example, the addition:

  a0 a1 a2 ...
+ b0 b1 b2 ...
  ============
  c0 c1 c2 ...

Note: the overflow (carry-out from c0) is discarded because it's not needed for the task.

How to write the code for this with arbitrary width? Well, there are two ways:

Operands in memory: use loops

Each number is stored in consecutive memory cells, and we may use indexed loops to access digits:

function add(a: u32, b: u32, c: u32, width: u32) {
  let carryOver: u64 = 0;
  for (let i: i32 = width - 1; i >= 0; i--) {
    let x: u64 = (load<u32>(a + i) as u64)
               + (load<u32>(b + i) as u64)
               + carryOver;
    carryOver = x >>> 32;
    store<u32>(c + i, x as u32);
  }
}

Operands are local vars: unroll loops

const width = 3
let a0: u64, a1: u64, a2: u64,
    b0: u64, b1: u64, b2: u64,
    c0: u64, c1: u64, c2: u64,
    carryOver: u64;
// ...
carryOver = 0;

c2 = a2 + b2;
carryOver = c2 >>> 32;
c2 &= 0xffffffff;

c1 = a1 + b1 + carryOver;
carryOver = c1 >>> 32;
c1 &= 0xffffffff;

c0 = a0 + b0 + carryOver;
c0 &= 0xffffffff;

Note: although digits are 32-bit, variables are 64-bit to retain carry-over.

Which is faster?

The approach with unrolled loops is faster (15-20% for my case). However, it requires having the separate code for each width. Happily, when the code is almost the same, you can easily generate it. I generate computations for width up to 12. I doubt anyone zooms deeper, but if you do, please file an issue 😉

Now, the fixed image. Indeed an UFO but much nicer 🙂

3. WASM vectorization

“Single instruction, multiple data” (SIMD) or “vectorization” is simply when your CPU executes the same instructions against multiple variables at once. In WASM you work with 128-bit vectors that contain 2 to 16 data items (“lanes”) depending on their size. For example, if your data items are 8-bit integers, one vector contains 16 lanes. In my case, data items are 64-bit (floating-point or integer), so I'm able to compute series for 2 pixels at once.

Here's the example. First, non-vectorized code for 64-bit float:

for (let i: u32 = 0; i < maxIterations; i++) {
  const xSqr: f64 = x * x;
  const ySqr: f64 = y * y;
  if (xSqr + ySqr > 4) {
    // Diverged
    break;
  }
  // Compute next (x, y) ...
}

Now the vectorized version:

for (let i: u32 = 0; i < maxIterations; i++) {
  const xSqr = v128.mul<f64>(x, x);
  const ySqr = v128.mul<f64>(y, y);
  const g4 = v128.g<f64>(v128.add<f64>(xSqr, ySqr), FOUR);

  if (!point0DivergedAt && v128.extract_lane<u64>(g4, 0)) {
    point0DivergedAt = i;
  }
  if (!point1DivergedAt && v128.extract_lane<u64>(g4, 1)) {
    point1DivergedAt = i;
  }
  if (point0DivergedAt && point1DivergedAt) {
    // Both diverged
    break;
  }
  // Compute next (x, y) ...
}

How faster did it get?

Floating-point arithmetic became exactly 2x faster. BigNumber got only 15–20% faster, and I don't know why, honestly. Perhaps my code has a problem that I cannot spot, but I hope to find it someday.

4. Optimization: filling same-color rectangles

Note how many same-color areas there are in the following example:

Computing almost same series for each pixel seems ineffective. Is it possible to optimize it? Happily, yes. With this optimization, rendering is coded as a recursive process of painting nested rectangles. The first rectangle is the whole render task given to the worker. Then, the recursive algorithm goes like this:

Render only the contour of the current rectangle + mid point; or, for small rectangles, only corners + mid point
- If all pixels are of the same color:
  - Fill the entire rectangle with that color
- Otherwise:
  - If the current rectangle is still not too small, split the current rectangle into two and recursively render each one
  - Otherwise, render all pixels of the current rectangle as usual

Here's the same image with fast-filled rectangles outlined:

This enormously boosts performance, especially for “stable” parts of the image, i.e., parts that do not reveal too much color variability.

Conclusion: is it finally fast enough?

The issue with the Mandelbrot set is that, as the iteration count and the BigNumber width constantly increase, it is not possible to eliminate lags, but only to postpone them. Optimizations allow you to explore it more deeply, but inevitably there will come a point where the required amount of computations exceeds any CPU power. This means that there will always be a room for optimizations, but at present, I seem to have exhausted all available options. Or, am I wrong? Can you suggest anything to improve? And thank you for reading!