Madness before UTF

#unicode #utf

Long time ago computers were using mostly ASCII encoding. ASCII stands for American Standard Code for Information Interchange and was very early (first revision published in 1963) attempt to unify binary representation of a text.

Originally it used 7 bits, allowing to store 2^7=128 characters. Lowercase and uppercase Latin letters, Arabic numerals and punctuations were included. Because it was used at first for teleprinter and teletype machines it contained also a lot of control characters, which do not have graphical representation but have some effect like moving to new line or confirming transmission. Here is the full list:

But what if someone needed to write another character? This is where creativity kicked in. ASCII was using 7 bits, but computers were mostly using 8 bits as base registry size. Meaning that the smallest chunk of data interchanged between CPU, RAM and disk was much bigger (2^8=256 combinations) than single ASCII character. This excessive space was used to store characters not included in base ASCII.

However that extra space in single byte could store only 128 additional characters. Way too few to write accented versions of Latin like ę, Cyrillic like Д, Kanji like 鰯, Greek, Cherokee, math symbols, dead alphabets like Runic, etc. Various encodings were created as ASCII extensions, each held in this extra 128 space only characters needed for specific use case. What seemed like brilliant idea was actually a poison that tormented computer industry for decades with:

Issue 1 - Operating system incompatibility

Some encodings were standardized by ISO, but Microsoft and Apple went their own way. So the same character ñ had 2 different binary values: 0xF1 in ISO-8859-1 and Microsoft CP-1252 but 0x96 if you were using Mac OS Roman encoding. If you wanted to write Ź it was 0xAC in ISO-8859-2 but 0x8F in Microsoft CP-1250 and you were out of luck in Mac OS Roman encoding which was not supporting it. It was truly a great time to receive document from a friend using different machine. Here are two encodings which allowed to write Polish characters, compared side by side. Above green line is base ASCII. Below green line is chaos with differences marked in red.

Issue 2 - Encoding switching

What if you wanted to write a Spanish name in your Russian text? If characters you needed were in two different encodings text had to contain some hidden instructions to switch encodings on the fly. Every office suite, every editor, every email client was using their own method back then. That led to common issues with copy-pasting, because those hidden instructions were not understood by another program. Copy-paste. Something we take for granted today was a painful experience in the past.

How bad was it? Well, I took liberty of searching in how many ways you could write Polish alphabet letters. I found whooping 26 encodings:

Hint: If you deal with retro tech iconv is your friend, allowing to convert between 140 encodings.

iconv -f CP1250 -t ISO-8859-2 windows_file.txt > iso_file.txt

On a white horse

Unification was an urgent need and when Unicode consortium announced "hey, we propose a common encoding for ALL characters to end this madness" it took computer world by a storm. Here is some interesting graphics from Wiki showing encodings popularity in years 2010-2021 and total UTF-8 domination we know today:

Coming up next: Variable encoding length to the rescue!

DEV Community

Madness before UTF

Top comments (0)

Read next

What developers really want

Preparing extensions for Joomla 6. CMSObject -> stdClass.

JavaScript interview questions & answers with code

Why Global Standards of API Design Save Your Team Time