We already know from previous posts of this series that UTF-8 is variable, multi byte encoding. But how does this exactly work? How does any program know where each character starts and how many bytes it has?
0
xxxxxxx
- This is 1 byte character. You may notice that it uses the same bits as 7 bit ASCII and that is correct - UTF-8 is compatible with ASCII. However this 0
is important, because it was repurposed to serve as a byte length terminator.
110
xxxxx
10
xxxxxx
- This is 2 bytes character.
1110
xxxx
10
xxxxxx
10
xxxxxx
- This is 3 bytes character.
11110
xxx
10
xxxxxx
10
xxxxxx
10
xxxxxx
- This is 4 bytes character.
So:
- Amount of
1
s before0
tells how many bytes multi byte character has. - Following bytes must start with
10
, which means they are multi byte character continuation. - If there are no leading
1
s it is ASCII.
Here are some real characters analyzed:
raku -e '"a".encode>>.fmt( "%08b" ).say'
(01100001)
$ raku -e '"ź".encode>>.fmt( "%08b" ).say'
(11000101 10111010)
$ raku -e '"😊".encode>>.fmt( "%08b" ).say'
(11110000 10011111 10011000 10001010)
Note on Raku: I will illustrate many UTF examples using Raku language. It has excellent built-in UTF support and compact syntax with no boilerplate. I also will explain syntax briefly, which may be outside of the scope of this series, but will help to understand what is going on in these one-liners.
In this case character is encoded into byte buffer. Each byte is passed to formatting function (>>
is just a lazy way to avoid for
or map
) which prints them as eight zero-padded bits.
Let's stop for a moment to admire genius UTF-8 design:
It is 7 bit ASCII compatible. Which also means it is space efficient, most commonly used letters use single byte.
Has natural protection against ASCII extensions described previously, which used
1
xxxxxxx
space. If byte representation of a text encounters byte starting with1
which does not match "amount of1
s followed by0
must be followed by the same amount of bytes starting with10
" pattern, then parser can detect that some crooked encoding is being loaded as UTF-8.
$ raku -e 'Buf.new( 0b10000000 ).decode'
Malformed UTF-8 at line 1 col 1
- Software does not have to know list of Unicode characters to find their boundaries. It is very easy to add basic UTF-8 support and multi byte character concept to very old programs by doing simple byte math.
- Programs does not have to support latest Unicode version. They can find start/end of an unknown character and display some replacement glyph without messing up the rest of text.
Coming up next: Fun with printing sound (optional). Codepoints, what does U+0105 mean?
Top comments (0)