In previous post of this series I explained how extending base 7 bit ASCII led to encodings chaos of biblical proportions. The issue was in the method used - everyone were trying to jam non-ASCII characters into free space of a single byte, causing conflicts and compatibility issues.
UTF people took a different approach - to store all characters* in a single encoding and dynamically extend its available space by using multiple bytes to store some less common characters.
[*] They are actually codepoints, but that will be explained later, let's call them characters for now.
Note: They were not the first to have this idea. CJK for Chinese and Shift-JIS for Japanese were hacking around single byte limitation of necessity, because even whole single byte could not fit those alphabets. If you like clever algorithms read about Shift-JIS, it is mind blowing. Also UCS can be considered the real technical predecessor of UTF.
Back to UTF - it comes in 3 variants:
- UTF-8 stores character using 1, 2, 3 or 4 bytes.
- UTF-16 stores character using 2 or 4 bytes (this is common misconception that "16" means 16 bits / 2 bytes only).
- UTF-32 stores character using 4 bytes.
So theoretical maximum capacity for characters is 2^32=4_294_967_296
. Real capacity is way lower because many bytes are used for namespace organization purposes. For example for UTF-8 it is 1_112_064
, but it still can be considered "unlimited" when compared to 128
capacity of 7 bit ASCII or 256
capacity of various encodings mentioned in previous post.
UTF-16 and UTF-32
Before I focus on UTF-8 I'd like to briefly talk about those two. UTF-16 was an abomination and never got popular. Main issue was weird characters organization, lack of backward compatibility with 7 bit ASCII and 0x00
(null bytes) used. Null bytes terminate strings in C language and it required extra care with memory management when reading this encoding. Here is rare picture of an unaware C programmer who read his first UTF-16 string ;)
UTF-32 has the same flaws and on top of that it seems like a huge waste of space - you need 4 times more bytes to store simple a
than in ASCII. However sometimes having predictable, fixed characters-to-bytes ratio is so beneficial that it outweighs additional space cost. For example if you need to access 128th character in a string it starts at 127*4+1
th byte in memory (assuming using composed form, which will be explained later).
Coming up next: Genius design of UTF-8.
Top comments (0)