DEV Community

Why No Modern Programming Language Should Have a 'Character' Data Type

Andrew (he/him) on May 27, 2020

Photo by Henry & Co. from Pexels Standards are useful. They quite literally allow us to communicate. If there were no standard grammar, no ...

Read full post

Eljay-Adobe • May 27 '20

In Swift, a Character is an extended grapheme cluster, which will consist of one-or-more Unicode scalar values. It's what a reader of a string will perceive as a single character. And a String consists of zero or more Characters.

Andrew (he/him) • May 28 '20

This is, I think, the compromise that comes closest to making sense. Check out the examples at grapheme-splitter -- I think the resulting graphemes align closely with the intuitive definition of a "character". However, think about how you would access and manipulate these graphemes programmatically: one code point at a type (or even one byte at a time). There's a disconnect between the programmer's understanding of a character and the layperson's understanding of a character. What I'm arguing is that eliminating the term "character" should eliminate that ambiguity.

Eljay-Adobe • May 28 '20

The API in Swift allows getting to a UTF-8 Encoding Unit, or a UTF-16 Encoding Unit, or a UTF-32 Codepoint. Treating them as an index into an array of those sub-Character things. (Depending on what the developer is trying to do.)

Swift and Python 3 both seem to have a good handle on Unicode strings.

Alas, I have to work with C++, which has somewhat underwhelming support for Unicode.

Pacharapol Withayasakpunt • May 28 '20

I learnt from Golang, that there is Rune.

Though, I am a little concerned about Byte vs String performance, if all you do is within 1 byte (e.g. ASCII / extended ASCII).

Andrew (he/him) • May 28 '20

UTF-8 tries to straddle performance and usability. By using a variable-width encoding, you're minimising memory usage. It just means that a few of the bits in the leading byte "go unused" because they indicate the number of bytes in the multi-byte character. This is a good compromise, but it still doesn't mean that a "character" should be defined as a variable-width UTF-8 element.

Pacharapol Withayasakpunt • May 28 '20

Hmm? What it have to do with combined characters and Zalgo?

Andrew (he/him) • May 28 '20

Zalgo is just an extreme form of combining marks / joiner characters. Most people would consider

Ȧ̛ͭ̔̔͑̅̈́̉͂̅̇͟͏̡͍̖̝͓̲̲͎̲̬̰̜̫̳̱̣͉͉̦

...to be a single character that just happens to have a lot of accent marks on it. If you define "character" as "grapheme", this is true. If you define "character" as "Unicode code point", it is not. That single "character" contains 34 UTF-16 elements. Try running this on CodePen and have a look at the console:

let ii = 0
const zalgo = "Ȧ̛ͭ̔̔͑̅̈́̉͂̅̇͟͏̡͍̖̝͓̲̲͎̲̬̰̜̫̳̱̣͉͉̦"


while (true) {
  let code = zalgo.charCodeAt(ii)
  if (Number.isNaN(code)) break
  console.log(`${ii}: ${code}`)
  ii += 1
}

The problem arises because programmers' intuitive understanding of "character" tends to be closer to "code point", while the average person's understanding of "character" tends to be closer to "grapheme".

Wannes Gennar • May 28 '20

Another nice article related to the subject is mortoray.com/2013/11/27/the-string...

I found this article through the Elixir docs

Andrew (he/him) • May 28 '20

@mortoray with the smart commentary, as usual 😎

Andrew (he/him) • May 28 '20

I think 4-byte len UTF-8 is possible (not essentially max to 3 bytes)

It is, UTF-8 can carry up to 4 bytes of information.

My point is that the terminology around what a "character" is has gotten so confusing that we should just stick to well-defined terms like "code point" and "grapheme". "Character" is sometimes confused with one or other of those (or something else entirely) and so I don't think it's a good name for a data type.

If you want to loop over "characters" in a string, you should loop over code points (which are composed of between 1-4 bytes). But why should someone ever want to loop over the individual bytes of a code point? This functionality could be provided, but not at the expense of clarity.

Andrew (he/him) • May 28 '20

You're right. When UTF-16 was introduced, it was fixed-width. But -- to accommodate 4-byte-width characters -- it's now a variable-width encoding. I'll edit the text to clarify that. Thanks!

Merry Themes • Jul 14 '20

Learn something new everyday!