UTF variable length to the rescue

#unicode #utf

In previous post of this series I explained how extending base 7 bit ASCII led to encodings chaos of biblical proportions. The issue was in the method used - everyone were trying to jam non-ASCII characters into free space of a single byte, causing conflicts and compatibility issues.

UTF people took a different approach - to store all characters* in a single encoding and dynamically extend its available space by using multiple bytes to store some less common characters.
[*] They are actually codepoints, but that will be explained later, let's call them characters for now.

Note: They were not the first to have this idea. CJK for Chinese and Shift-JIS for Japanese were hacking around single byte limitation of necessity, because even whole single byte could not fit those alphabets. If you like clever algorithms read about Shift-JIS, it is mind blowing. Also UCS can be considered the real technical predecessor of UTF.

Back to UTF - it comes in 3 variants:

UTF-8 stores character using 1, 2, 3 or 4 bytes.
UTF-16 stores character using 2 or 4 bytes (this is common misconception that "16" means 16 bits / 2 bytes only).
UTF-32 stores character using 4 bytes.

So theoretical maximum capacity for characters is 2^32=4_294_967_296. Real capacity is way lower because many bytes are used for namespace organization purposes. For example for UTF-8 it is 1_112_064, but it still can be considered "unlimited" when compared to 128 capacity of 7 bit ASCII or 256 capacity of various encodings mentioned in previous post.

UTF-16 and UTF-32

Before I focus on UTF-8 I'd like to briefly talk about those two. UTF-16 was an abomination and never got popular. Main issue was weird characters organization, lack of backward compatibility with 7 bit ASCII and 0x00 (null bytes) used. Null bytes terminate strings in C language and it required extra care with memory management when reading this encoding. Here is rare picture of an unaware C programmer who read his first UTF-16 string ;)

UTF-32 has the same flaws and on top of that it seems like a huge waste of space - you need 4 times more bytes to store simple a than in ASCII. However sometimes having predictable, fixed characters-to-bytes ratio is so beneficial that it outweighs additional space cost. For example if you need to access 128th character in a string it starts at 127*4+1th byte in memory (assuming using composed form, which will be explained later).

Coming up next: Genius design of UTF-8.

DEV Community

UTF variable length to the rescue

Top comments (0)

Read next

Microservice Interview Questions

Mastering Ansible: The Essential Guide for DevOps Engineers

Learn Design Patterns: Understanding the Builder Pattern

Managing high traffic applications with AWS Elastic Load Balancer and Terraform