To achieve a maximum decrease in the amount of data, which is being transferred with each web request and response, HTTP/2 protocol uses HPACK format for encoding headers and Hoffman algorithm for its literal values.
In our previous article HAPCK: Huffman encoder I explained what Huffman coding is and how exactly the algorithm is used to encode the content. I continue explaining this by looking at this process from a reversed point of view. Meaning; how can we convert the encoded data back to its original form, again optimally and sustainably performance-wise. For clarity, and since there is a lot of content to cover, I’ve decided to split this into two parts, also because encoding itself entails two separate procedures.
If you search the web you won’t find a lot of information about the Huffman decoding process, and there are even less concrete examples describing the actual process. Such information can be found in scientific articles, which are hardly readable for the general public. So I have been researching this question extensively. Luckily my friend William Entriken guided me in the search for the optimal solutions by sharing how he used this kind of approach while playing chess. The trick was in The Huffman data tree being flattened to a 2-dimensional table of transitions.
When the web server receives a header, for which it determines that it contains content encoded with the canonical Huffman algorithm, it has to decode this content in the shortest possible time with as few resources as possible. The execution speed of this “simple“ task will contribute significantly to the server’s response time, and this time must be as short as possible.
Encoded Huffman data represents a bit-sequence of zeros and ones. This is where we are faced with the question number. 1: Which and how many ones and zeros represent what character?
Reading and decoding bit by bit appears to be inadequate performance-wise. I guess we all know how to read the data with reader or stream objects, thus we are aware that reading in buffered chunks outperforms reading bit by bit. So the first trick of fast Huffman decoding is reading N-bits at a time. However, this information alone does not help us much, since we cannot determine how the seemingly “random” Huffman sequence corresponds to actual data. The solution is not just in the flattening of the Huffman tree into a two-dimensional table, but to create such a matrix that enables decoding N-bits at a time. Note that we can create such a matrix for an arbitrary number of bits that the decoder will read at a time.
HPACK documentation provides an already prepared and for the web optimized Huffman code for all ASCII characters. To implement the Huffman algorithm for HPACK we’ll have to first flatten this table to a two-dimensional matrix as mentioned above. This would allow for reversing the encoded Huffman sequence back into the readable characters.
First, let’s look at such flattening on a very simple example. Our algorithm will enable the conversion of letters A, B, C, D, and E into a Huffman sequence. The Huffman code for each letter is shown in the table below.
Character | Huffman code |
---|---|
A | 00 |
B | 01 |
C | 100 |
D | 101 |
E | 11 |
We have decided to flatten the Huffman table into a matrix, enabling the decoder to read Huffman bit-sequence 2-bits at a time. The illustration below shows the table structure we need to fill in.
PATH | ID | SYM | LFT | 00 | 01 | 10 | 11 |
---|---|---|---|---|---|---|---|
// | 0 | - | - | - | - | - | - |
The first column PATH will serve for our notes in which we’ll store read bits so we will know what sequence refers to what table row. Reading of each character’s code always starts in the root row marked with //
. The column ID
will store the unique name of the row. The first row is marked with 0
. The column SYM
will store characters (e.g. A). Field LFT
will store the information about the leftover bits. A leftover bit is a number of bits, missing to reach the full bit chunk (in our case 2 bits). For example, letter C and D have a leftover of 1, because to reach a round number of bits, which is in our case 2 bits * N, 1 bit remains. Letters A, B, and E have no leftover. The remaining columns represent the read chunk of 2 bits for all its possible values ranging from 00
(0) to 11
(3).
The table above will now be filled with data of sample Huffman coding. As mentioned previously, we are reading the Hoffman code 2-bits at a time. Let’s see how to insert data to the table for the first letter A.
Letter A is represented with code 00
. Since there is no path //00
for this code in the first column, we create a new line with a new ID
. There is no leftover, and in the root line to column 00
we write the ID
of the newly established line. Since we read all the bits for the letter A, we also write character A in the SYM
column.
Path | ID | SYM | LFT | 00 | 01 | 10 | 11 |
---|---|---|---|---|---|---|---|
// | 0 | - | - | 1 | - | - | - |
//00 | 1 | A | 0 | - | - | - | - |
We then repeat this process for the letter B. The letter B is represented with code 01
. Since there is no path //01
for this code, we create a new line with a new ID
. There is no leftover, and in the root line in column 01
we write the ID
of the newly established line. Since we read all the bits for the letter B, we also write character B to the SYM
column.
Path | ID | SYM | LFT | 00 | 01 | 10 | 11 |
---|---|---|---|---|---|---|---|
// | 0 | - | - | 1 | 2 | - | - |
//00 | 1 | A | 0 | - | - | - | - |
//01 | 2 | B | 0 | - | - | - | - |
The process for the letter C is somewhat different since its number of bits doesn’t correspond to 2-bits * N. The final bit is therefore missing, so we claim that it has a leftover of 1. First, we read the first 2 bits and insert them in the table following the same process as before. After that, we read the remaining bit, while assuming that all the possible variations of the missing bit exist. This is marked with X
. Since one bit is missing, we note this in the column LFT
.
Path | ID | SYM | LFT | 00 | 01 | 10 | 11 |
---|---|---|---|---|---|---|---|
// | 0 | - | - | 1 | 2 | 3 | - |
//00 | 1 | A | 0 | - | - | - | - |
//01 | 2 | B | 0 | - | - | - | - |
//10 | 2 | B | 0 | - | - | - | - |
//10 | 3 | - | - | 4 | 4 | - | - |
//100X | 4 | C | 1 | - | - | - | - |
We repeat the process for letters D and E. The final table should look like this:
Path | ID | SYM | LFT | 00 | 01 | 10 | 11 |
---|---|---|---|---|---|---|---|
// | 0 | - | - | 1 | 2 | 3 | 6 |
//00 | 1 | A | 0 | - | - | - | - |
//01 | 2 | B | 0 | - | - | - | - |
//10 | 2 | B | 0 | - | - | - | - |
//10 | 3 | - | - | 4 | 4 | 5 | 5 |
//100X | 4 | C | 1 | - | - | - | - |
//101X | 5 | D | 1 | - | - | - | - |
//11 | 6 | E | 0 | - | - | - | - |
Note that it would be correct to replace the variants marked with X with actual possible paths.
Path | ID | SYM | LFT | 00 | 01 | 10 | 11 |
---|---|---|---|---|---|---|---|
// | 0 | - | - | 1 | 2 | 3 | 6 |
//00 | 1 | A | 0 | - | - | - | - |
//01 | 2 | B | 0 | - | - | - | - |
//10 | 2 | B | 0 | - | - | - | - |
//10 | 3 | - | - | 4 | 4 | 5 | 5 |
//1000 | 4 | C | 1 | - | - | - | - |
//1001 | 4 | C | 1 | - | - | - | - |
//1010 | 5 | D | 1 | - | - | - | - |
//1011 | 5 | D | 1 | - | - | - | - |
//11 | 6 | E | 0 | - | - | - | - |
The flattened form of the Huffman tree in the form of a matrix plays a crucial role in the process of decoding. I wrote a complete implementation of such a flattener for generating translation matrixes with the support for N-bits. It's written in Rust and is available open-source on GitHub.
We now have an idea of what the process of decoding looks like, using this matrix. I’ll be talking about this in the next article, where we’ll look at the decoding process in detail.
The next article HPACK: Huffman decoder continues here and describes the full decoding process.
Top comments (0)