After 17 years working with computers, some of the most fundamental building blocks we reach for every day are still not universal.
General purpose programming languages came into existence in the 1960s with IBM's System/360, with Fortran, COBOL and early Lisps. The use of "String"
in computer science goes back even a decade earlier, with a long-since obsolete and esoteirc language COMIT.
I mention this just to highlight that we've had long enough to do better at this, as an industry.
First, let's digress a little into serialization, just to separate the concepteptual part (hello category theory đź‘‹), and the reality that much of what software engineers do day-by-day is serialization, and deserialization of data over the wire, be that from browser to server over the web, or from server to data store such as into a database.
Serialization, and deserialization are such a massive part of what we do, it's no wonder that the "lowest common denominator" is the state of the art we are left with. If we had a better way to share "types" across systems, we'd effectively need everything from browsers, to toasters (thanks IoT folks), to servers, to middlewares, etc to support the new standards. Everyone can "fall back" to a string, so that's unfortunately what we get left with.
Eric S. Raymond, lamented once upon a time that Unix's greatest triumph, and biggest mistake were that "everything is a bag of bytes". The metaphor underpinning most of Linux and Unix's operating model for most of the last 35 years has been "everything is a file", and a file is a sequence of bytes.
"String" utility functions are broadly available, and it's a common paradigm.
Serialization aside for a moment, looking into programming languages themselves, by and large, the String type (or something akin to it, even in ducktyped languages such as Ruby) is ubiquitous.
Strings, character arrays, suchlike are great, except all the times they aren't.
A typical string handling library will expose functions to upercase, lowercase, trim, reverse, transpose, camel-case, title-case, and more, and more, and more. If we're extremely fortunate, we might have a library at hand which can reliably normalize Unicode diacritical characters, where an ¨ and a u can be combined three ways to make a ü (¨u, u¨, or ü), which, if we're folding could be expanded to "ue" if we're in German, most of the time, except place names, but not so for Swedish or Danish.
Strings, in most languages, we are thankful to say are encoded in UTF-8, a variable byte size representation which can represent up to 2,097,152 "code points" (The Unicode Consortium has identified about 1,112,00 so far). The variable size encoding means that an a
in UTF-8 is the same as an a
in ISO 8859-2, helping us leave the rocky late-1990s behind, where it was different flavours of ISO 8859-2 which left us with documents from Greek computers scrambled when viewed on a French machine, because the values from 128-256 in the extended ASCII character set values were assigned differently based on the locale.
Speaking of ASCII, the lowest of lowest common denominators, a formerly 7 bit encoding, expanded to 8 bits somewhere close to the dawn of time, with 7 bits, a whole 128 codepoints could be encoded, from which most English speakers would recognize around about ~90; the other ~50 are so-called "control characters", printer control, terminal control, a virtual bell, a backspace character, etc.
Every "String" in every programming language in the world will accept valid ASCII.. and that somewhat long introduction lets me finally make two points:
- Strings are too liberal, a correctly implemented stack of software would happily accept a string of serialized backspace characters as a valid "name" in a form. It's all character data, after all.
- Most of the functions and methods for working with a String in most programming languages actually render our strings useless. Reversing a human name, URL, or or SSN is semantically meaningless and mangles the data; upercasing an email address (the definition of which spans two separate RFCs) is poorly defined.
Strings, are, once the data is deserialized, almost never the correct data type to use, but we don't have either a robust ubiquitous library (let's not even talk about portable calling conventions across languages, before our grandfather's C enters the chat), nor a robust serialization format we can ever dream of getting widespread attention.
There are prescious few specifications talking about
Here's some examples:
ASCII control codes in a Name
Easily enough, someone could submit a form with their name, but mistakenly copy a 0x01 bytes ("START OF HEADING"), that byte is invisible, so if it's ever interpolated into a document (e.g into a PDF) the result is incorrect. Instead of a "String" we should use a type for human names which is defined as only comprising visible alpha-numeric (sorry X Æ A-12) characters, some ligatures, some dashes (unicode defines 25, from which most people would assume 2-3 can be a valid part of a name). It becomes more complicated with non-English names, or names transliterated into their original alphabets, but the rules are the same.
- A "Noun" must serialize cleanly to text, and back in a way that would survive printing and scanning (for compatibility with 2000 years of human government beurocracy)
- Two nouns might compare equally even if they are in different alphabets ("Munich" and "München" are the same city in English and German, and "München" and "Muenchen" are both valid German spellings, similarly "Cologne" and "Köln" are the same, and "Hamburg" (English, German) and "Hamborg" (French) are all the same.)
Unicode Characters in a restaurant name
Unicode is amazing, but the taxonomy leaves a little to be desired
Phone Number as a "String"
⚠️ This has been sat as a draft for a while, but I'm publishing it incomplete because my point stands, and collecting extra examples helps noone. Strings suck.
Top comments (0)