Photo by Henry & Co. from Pexels
Standards are useful. They quite literally allow us to communicate. If there were no standard grammar, no ...
For further actions, you may consider blocking this person and/or reporting abuse
In Swift, a
Character
is an extended grapheme cluster, which will consist of one-or-more Unicode scalar values. It's what a reader of a string will perceive as a single character. And aString
consists of zero or moreCharacters
.This is, I think, the compromise that comes closest to making sense. Check out the examples at grapheme-splitter -- I think the resulting graphemes align closely with the intuitive definition of a "character". However, think about how you would access and manipulate these graphemes programmatically: one code point at a type (or even one byte at a time). There's a disconnect between the programmer's understanding of a character and the layperson's understanding of a character. What I'm arguing is that eliminating the term "character" should eliminate that ambiguity.
The API in Swift allows getting to a UTF-8 Encoding Unit, or a UTF-16 Encoding Unit, or a UTF-32 Codepoint. Treating them as an index into an array of those sub-Character things. (Depending on what the developer is trying to do.)
Swift and Python 3 both seem to have a good handle on Unicode strings.
Alas, I have to work with C++, which has somewhat underwhelming support for Unicode.
I learnt from Golang, that there is Rune.
Though, I am a little concerned about Byte vs String performance, if all you do is within 1 byte (e.g. ASCII / extended ASCII).
UTF-8 tries to straddle performance and usability. By using a variable-width encoding, you're minimising memory usage. It just means that a few of the bits in the leading byte "go unused" because they indicate the number of bytes in the multi-byte character. This is a good compromise, but it still doesn't mean that a "character" should be defined as a variable-width UTF-8 element.
Hmm? What it have to do with combined characters and Zalgo?
Zalgo is just an extreme form of combining marks / joiner characters. Most people would consider
...to be a single character that just happens to have a lot of accent marks on it. If you define "character" as "grapheme", this is true. If you define "character" as "Unicode code point", it is not. That single "character" contains 34 UTF-16 elements. Try running this on CodePen and have a look at the console:
The problem arises because programmers' intuitive understanding of "character" tends to be closer to "code point", while the average person's understanding of "character" tends to be closer to "grapheme".
Another nice article related to the subject is mortoray.com/2013/11/27/the-string...
I found this article through the Elixir docs
@mortoray with the smart commentary, as usual 😎
It is, UTF-8 can carry up to 4 bytes of information.
My point is that the terminology around what a "character" is has gotten so confusing that we should just stick to well-defined terms like "code point" and "grapheme". "Character" is sometimes confused with one or other of those (or something else entirely) and so I don't think it's a good name for a data type.
If you want to loop over "characters" in a string, you should loop over code points (which are composed of between 1-4 bytes). But why should someone ever want to loop over the individual bytes of a code point? This functionality could be provided, but not at the expense of clarity.
You're right. When UTF-16 was introduced, it was fixed-width. But -- to accommodate 4-byte-width characters -- it's now a variable-width encoding. I'll edit the text to clarify that. Thanks!
Learn something new everyday!