UTF-8 internal design

#unicode #utf #raku

We already know from previous posts of this series that UTF-8 is variable, multi byte encoding. But how does this exactly work? How does any program know where each character starts and how many bytes it has?

0xxxxxxx - This is 1 byte character. You may notice that it uses the same bits as 7 bit ASCII and that is correct - UTF-8 is compatible with ASCII. However this 0 is important, because it was repurposed to serve as a byte length terminator.
110xxxxx 10xxxxxx - This is 2 bytes character.
1110xxxx 10xxxxxx 10xxxxxx - This is 3 bytes character.
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx - This is 4 bytes character.

So:

Amount of 1s before 0 tells how many bytes multi byte character has.
Following bytes must start with 10, which means they are multi byte character continuation.
If there are no leading 1s it is ASCII.

Here are some real characters analyzed:

raku -e '"a".encode>>.fmt( "%08b" ).say'
(01100001)

$ raku -e '"ź".encode>>.fmt( "%08b" ).say'
(11000101 10111010)

$ raku -e '"😊".encode>>.fmt( "%08b" ).say'
(11110000 10011111 10011000 10001010)

Note on Raku: I will illustrate many UTF examples using Raku language. It has excellent built-in UTF support and compact syntax with no boilerplate. I also will explain syntax briefly, which may be outside of the scope of this series, but will help to understand what is going on in these one-liners.
In this case character is encoded into byte buffer. Each byte is passed to formatting function (>> is just a lazy way to avoid for or map) which prints them as eight zero-padded bits.

Let's stop for a moment to admire genius UTF-8 design:

It is 7 bit ASCII compatible. Which also means it is space efficient, most commonly used letters use single byte.
Has natural protection against ASCII extensions described previously, which used 1xxxxxxx space. If byte representation of a text encounters byte starting with 1 which does not match "amount of 1s followed by 0 must be followed by the same amount of bytes starting with 10" pattern, then parser can detect that some crooked encoding is being loaded as UTF-8.

$ raku -e 'Buf.new( 0b10000000 ).decode'
Malformed UTF-8 at line 1 col 1

Software does not have to know list of Unicode characters to find their boundaries. It is very easy to add basic UTF-8 support and multi byte character concept to very old programs by doing simple byte math.

Programs does not have to support latest Unicode version. They can find start/end of an unknown character and display some replacement glyph without messing up the rest of text.

Coming up next: Fun with printing sound (optional). Codepoints, what does U+0105 mean?

DEV Community

UTF-8 internal design

Top comments (0)

Read next

Docker Autoscaling: Dynamically Adjust Containers Based on Demand

well... thinkpads are awesome

Docker Distributed Storage: GlusterFS vs. Ceph for Persistent Container Data

Unlocking Docker BuildKit for Faster and More Secure Builds