DEV Community

Cover image for Why No Modern Programming Language Should Have a 'Character' Data Type

Why No Modern Programming Language Should Have a 'Character' Data Type

Andrew (he/him) on May 27, 2020

Photo by Henry & Co. from Pexels Standards are useful. They quite literally allow us to communicate. If there were no standard grammar, no ...
Collapse
 
eljayadobe profile image
Eljay-Adobe

In Swift, a Character is an extended grapheme cluster, which will consist of one-or-more Unicode scalar values. It's what a reader of a string will perceive as a single character. And a String consists of zero or more Characters.

Collapse
 
awwsmm profile image
Andrew (he/him)

This is, I think, the compromise that comes closest to making sense. Check out the examples at grapheme-splitter -- I think the resulting graphemes align closely with the intuitive definition of a "character". However, think about how you would access and manipulate these graphemes programmatically: one code point at a type (or even one byte at a time). There's a disconnect between the programmer's understanding of a character and the layperson's understanding of a character. What I'm arguing is that eliminating the term "character" should eliminate that ambiguity.

Collapse
 
eljayadobe profile image
Eljay-Adobe

The API in Swift allows getting to a UTF-8 Encoding Unit, or a UTF-16 Encoding Unit, or a UTF-32 Codepoint. Treating them as an index into an array of those sub-Character things. (Depending on what the developer is trying to do.)

Swift and Python 3 both seem to have a good handle on Unicode strings.

Alas, I have to work with C++, which has somewhat underwhelming support for Unicode.

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt

I learnt from Golang, that there is Rune.

Though, I am a little concerned about Byte vs String performance, if all you do is within 1 byte (e.g. ASCII / extended ASCII).

Collapse
 
awwsmm profile image
Andrew (he/him)

UTF-8 tries to straddle performance and usability. By using a variable-width encoding, you're minimising memory usage. It just means that a few of the bits in the leading byte "go unused" because they indicate the number of bytes in the multi-byte character. This is a good compromise, but it still doesn't mean that a "character" should be defined as a variable-width UTF-8 element.

Collapse
 
patarapolw profile image
Pacharapol Withayasakpunt

Hmm? What it have to do with combined characters and Zalgo?

Thread Thread
 
awwsmm profile image
Andrew (he/him)

Zalgo is just an extreme form of combining marks / joiner characters. Most people would consider

Ȧ̛ͭ̔̔͑̅̈́̉͂̅̇͟͏̡͍̖̝͓̲̲͎̲̬̰̜̫̳̱̣͉͉̦

...to be a single character that just happens to have a lot of accent marks on it. If you define "character" as "grapheme", this is true. If you define "character" as "Unicode code point", it is not. That single "character" contains 34 UTF-16 elements. Try running this on CodePen and have a look at the console:

let ii = 0
const zalgo = "Ȧ̛ͭ̔̔͑̅̈́̉͂̅̇͟͏̡͍̖̝͓̲̲͎̲̬̰̜̫̳̱̣͉͉̦"


while (true) {
  let code = zalgo.charCodeAt(ii)
  if (Number.isNaN(code)) break
  console.log(`${ii}: ${code}`)
  ii += 1
}

The problem arises because programmers' intuitive understanding of "character" tends to be closer to "code point", while the average person's understanding of "character" tends to be closer to "grapheme".

Collapse
 
dealloc profile image
Wannes Gennar

Another nice article related to the subject is mortoray.com/2013/11/27/the-string...

I found this article through the Elixir docs

Collapse
 
awwsmm profile image
Andrew (he/him)

@mortoray with the smart commentary, as usual 😎

Collapse
 
awwsmm profile image
Andrew (he/him)

I think 4-byte len UTF-8 is possible (not essentially max to 3 bytes)

It is, UTF-8 can carry up to 4 bytes of information.

My point is that the terminology around what a "character" is has gotten so confusing that we should just stick to well-defined terms like "code point" and "grapheme". "Character" is sometimes confused with one or other of those (or something else entirely) and so I don't think it's a good name for a data type.

If you want to loop over "characters" in a string, you should loop over code points (which are composed of between 1-4 bytes). But why should someone ever want to loop over the individual bytes of a code point? This functionality could be provided, but not at the expense of clarity.

Collapse
 
awwsmm profile image
Andrew (he/him)

You're right. When UTF-16 was introduced, it was fixed-width. But -- to accommodate 4-byte-width characters -- it's now a variable-width encoding. I'll edit the text to clarify that. Thanks!

Collapse
 
merrythemes profile image
Merry Themes

Learn something new everyday!