Paweł bbkr Pabian

Posted on Aug 10, 2023 • Edited on May 7

UTF-8 (de)composition

#unicode #utf #raku

Composition is a process of transposing base grapheme followed by combining code points into single grapheme.

Let's start with simple letter a:



$ raku -e '
    my $text = "a";
    $text.uniname.say;
    $text.ord.base( 16 ).say;
    $text.chars.say;
    $text.codes.say;
    $text.encode.bytes.say;
'

LATIN SMALL LETTER A # Code point name
61                   # Code point number
1                    # Single character
1                    # Single code point
1                    # Encoded in UTF-8 using single byte

Raku note: This language has no length method on strings, because in Unicode world it is super confusing. Instead there are separate methods to ask precisely about amount of characters, amount of code points and amount of bytes.

Let's do the same for "ogonek" (tiny tail), which is combining code point that appeared in previous posts:



$ raku -e '
    my $text = "\c[COMBINING OGONEK]";
    $text.ord.base( 16 ).say;
    $text.chars.say;
    $text.codes.say;
    $text.encode.bytes.say;
'

328 # Code point number
1   # Single character
1   # Single code point
2   # Encoded in UTF-8 using two bytes

And smash them together:



$ raku -e '
    my $text = "a\c[COMBINING OGONEK]";
    $text.say;
    $text.uniname.say;
    $text.ord.base( 16 ).say;
    $text.chars.say;
    $text.codes.say;
    $text.encode.bytes.say;
'

ą                                # Glyph
LATIN SMALL LETTER A WITH OGONEK # Code point name
105                              # Code point number
1                    # Single character
1                    # Single code point
2                    # Encoded in UTF-8 using two bytes

Our two code points U+61 and U+328 were composed together and produced another code point U+105. Which is more obvious when we look at glyphs: a + ̨ = ą.

(source: Warren Photographics)

In less technical terms

Composition reflects natural language. Sometimes base letters in given script were not enough to express nuances in given language. To solve that, derivatives of base letters were created by adding small modifiers to indicate pronunciation accent / tone / stress differences. Those modifiers are commonly known as "diacritic glyphs". Most known are: acute, macron, tilde, grave, diaeresis, ogonek, etc.

But why Unicode decided to make two ways of expressing the same stuff?

Compression

In the example above base character is 1 byte, diacritic glyph is 2 bytes. By having composed ą code point in 2 byte space it can be written using 2 bytes instead of 3. This quickly adds up in alphabets using diacritics extensively, so +1 for composed form.

Comparison

While comparing two texts both composed or decomposed forms can be used. Assuming of course that compared texts are using the same form consistently. However the problem occurs when there is more than one combining code point, like for example in ǭ.



raku -e '
    my $text1 = "\c[LATIN SMALL LETTER O]\c[COMBINING MACRON]\c[COMBINING OGONEK]";
    my $text2 = "\c[LATIN SMALL LETTER O]\c[COMBINING OGONEK]\c[COMBINING MACRON]";
    say $text1 eq $text2;
    $text1.uniname.say;
'

True
LATIN SMALL LETTER O WITH OGONEK AND MACRON

Order of combining characters is irrelevant in composition. Both texts above are equal, despite the fact that they were composed from code points in different order. This comparison will fail when decomposed form is used, so +1 for composed one.

Base comparison

Skipping diacritics is very common. Most of you would write in search engine Josip Belusic when looking for information about Croatian inventor Josip Belušić. And it becomes even more common with smartphones, where limited keyboard space and single hand typing discourage proper use of diacritics.

Previously s and š characters were completely unrelated code points, for example in ISO-8859-1 encoding. So a lot of search engines used huge mapping dictionaries to implement "Do What I Mean" behavior and provide results when diacritics were and were not used in search query.

With Unicode not only it is easy to get base characters form without having diacritic mappings:



$ raku -e '"Josip Belušić".samemark( "a" ).say'

Josip Belusic

Raku note: This counterintuitive syntax is explained here. Luckily more friendly and faster method nomark() will be added to Raku soon by courtesy of @lizmat.

But also it is easy to match base characters in regular expressions:



$ raku -e 'say "Josip Belušić" ~~ m:ignoremark/ Belusic /'

｢Belušić｣ # Matched part of text

That gives +2 for decomposed form functionality, resulting in a tie. Both composed and decomposed forms provide nice features for people working with text, and it was good decision to have them both in Unicode.

Stroke trap!

There are STROKE combining characters like COMBINING SHORT STROKE OVERLAY defined in Unicode. But stroked letters do not decompose:



$ raku -e '"Grøn gås".samemark("a").say'  # Green goose in Danish

Grøn gas

$ raku -e '"żółw".samemark("a").say' # Turtle in Polish

zołw

$ raku -e '.say for "łø".uninames'

LATIN SMALL LETTER L WITH STROKE
LATIN SMALL LETTER O WITH STROKE

Why? I was unable to find. They clearly have base Latin letter. If you know please share in the comments.

More traps!

Æ does not decompose, it is simply LATIN CAPITAL LETTER AE, not A WITH E.

German ß does not decompose to SS because this transition only happens when case is changed.

Kanji does not decompose to Katakana or Hiragana, despite the fact that Katakana / Hiragana glyphs are often part of Kanji characters.

Roman numerals like Ⅳ or Ⅺ do not decompose.

Trivia

In Raku you can not switch between composed and decomposed forms of a string because all strings are automatically composed. However there are methods to get binary representations of both forms:



$ raku -e '"ǭ".NFC.say; "ǭ".NFD.say;'

NFC:0x<01ed>
NFD:0x<006f 0328 0304>

If you want to find out what string decomposes into you can convert it back to code point names:



$ raku -e '.uniname.say for "ǭ".NFD'

LATIN SMALL LETTER O
COMBINING OGONEK
COMBINING MACRON

What happens if there is no composing code point and no glyph to represent it?

Funny stuff. Your browser or text editor will try to render is somehow. Sometimes as character followed by composing glyph, sometimes as overlay.



$ raku -e '"\c[LATIN SMALL LETTER H]\c[COMBINING OGONEK]".say'

h̨ # There is no such letter in any alphabet

Is composition used only for diacritics?

No. There is whole Code for inherited script category with tons of weird composable characters.



raku -e \c[LATIN SMALL LETTER O]\c[COMBINING LATIN SMALL LETTER O]".say'

oͦ # Snowman?

Does decomposition work with Emoji modifiers?

Yes.



$ raku -e 'say "👍🏿" ~~ m:ignoremark/ "👍" /'

｢👍🏿｣

Coming up next: Grapheme clusters.

DEV Community

UTF-8 (de)composition

Top comments (0)

Read next

Introducing LightUp: AI-Powered Annotations for the Web

Stop Trying to Learn Everything -Focus on These 5 Key Skills Every Developer Needs

Step-by-Step Guide: Running LLM Models with Ollama

Part 9: Exception Handling in C#