Hi and welcome to the series that will explain various aspects of UTF encoding.
Let's start with common misconception: Unicode and UTF. Many people use those terms interchangeably and say that "This text has Unicode encoding". However these are not synonyms.
Unicode is a consortium. Non-profit corporation devoted to developing, maintaining, and promoting software internationalization standards and data. Here is their logo:
They created and maintain Unicode standard, which catalogues all characters used worldwide. Current version 15.0
contains 149 186
characters.
UTF stands for Unicode Transformation Format and it is the technical implementation of Unicode standard. Tells how to represent all those catalogued characters as bytes. It has UTF-8, UTF-16 and UTF-32 variants (which will be explained later). But also less common encodings like BOCU and SCSU implement the same standard but are binary incompatible with UTF.
So if you refer to specific byte representation of a text (like a document on a disk or variable in a memory) you should say precisely "This text has UTF-8 encoding".
Coming up next: Madness before UTF - a short history lesson about dark times.
Top comments (0)