Paweł bbkr Pabian

Posted on Aug 5, 2023 • Edited on Aug 30, 2023

UTF-8 code point properties

#unicode #utf #raku

Having single encoding to express every intent in every language is awesome. But that comes at a cost of dealing with huge amount of characters you probably never seen before. Luckily every code point has a set of properties that may help you with text processing. There are over 100 properties total, but the most useful are:

Major/minor category.

$ raku -e '"a".uniprop.say'
Ll

$ raku -e '"3".uniprop.say'
Nd

When uniprop is called without params in Raku it returns acronym for major/minor category. Ll means letter/lowercase, Nd means number/decimal digit. Full list is available here. Those can also be tested independently as shown below.

Letter, Number, Punctuation, Separator.

Daily bread of text processing.

raku -e '
    my $text = "Is this 1970 Dodge?";
    say $text;
    for "Letter", "Number", "Separator", "Punctuation" {
        $text.uniprops( $_ ).join.say;
    }
'

Is this 1970 Dodge?
1101111000000111110 # Letters
0000000011110000000 # Numbers
0010000100001000000 # Separators
0000000000000000001 # Punctuation

Method uniprops is the same as uniprop, but returns property values for all characters in string. Both uniprop and uniprops methods can be given specific property to test against.

Script

$ raku -e '"aГΦح日".uniprops( "Script" ).say'

(Latin Cyrillic Greek Arabic Han)

Script is a writing system. It should not be confused with alphabet - for example a and ą are both Latin script but the ą only belongs to Polish alphabet. And it should not be confused with language, despite the fact that it sometimes alignes with it - for example Greek.

How many scripts are there?

$ raku -e '.say for ( 1 .. 1_112_064 ).map( *.uniprop( "Script" ) ).unique'

Common
Latin
Bopomofo
Inherited
Greek
Unknown
Coptic
Cyrillic
Armenian
Hebrew
Arabic
Syriac
Thaana
Nko
...

The answer is 158.

Casing

Did you know that "lowercase" and "uppercase" terms come from printing press? Page was composed from metal stamps. Stamps with letters a, b, c were used more often than stamps with letters A, B, C. So a, b, c were stored in lower case on the desk, easier to reach. While A, B, C were stored in upper case above desk to save space.

What do we know about case of A?

$ raku -e 'say $_, " ", "A".uniprop( $_ ) for "Cased", "Lowercase", "Uppercase", "Lowercase_Mapping"'

Cased True
Lowercase False
Uppercase True
Lowercase_Mapping a

I won't go into rabbit hole of titlecase vs uppercase. Foldcase used for comparison will appear in another post of this series. But Cased is an interesting property, showing if given code point has upppercase/lowercase form. For example there is no concept of letter case in Kanji:

$ raku -e '"女".uniprop( "Cased" ).say'
False

$ raku -e 'say "女".lc.ord == "女".uc.ord'
True

Numeric value

For numbers Unicode also holds value.

$ raku -e '"4 Ⅴ ¾ 8️⃣ ㊷ 兆".uniprops( "Numeric_Value" ).grep( Int|Rat ).say'

(4 5 0.75 8 42 1000000000000)

This is super useful in text normalization. Although you must be aware of different numeric systems if you want to convert text to numeric type in your programming language. For example ⅤⅠ is not 51 but 6 in Roman numerals. Also Roman numerals are tricky - because of frequent use in the past on watches and clocks there are code points defined up to value of 12. So Roman 11 can be expressed by single code point Ⅺ or by two code points ⅩⅠ.

Just for completion - there are no numeric values defined for constants like π or ℇ. Despite the fact, that ℇ is EULER CONSTANT U+2107 character that has no other purpose in life than being a numeric value.

Raku note: I added spaces between graphemes to increase readability, so later I had to filter out NaN values from the result by extracting only Integers and Rationals.

more Punctuation

Punctuation is huge in Unicode, there are 7 main categories for it - Connector, Dash, Open, Close, Initial, Final, Other. Full explanation is way beyond the scope of this series, but I want to show few examples that may help you working with text right away.

Regular ASCII? Ethiopic full stop? Double question mark? Extracting sentences from text has never been so easy:

$ raku -e '"!?.።⁇".uniprops( "Sentence_Terminal" ).say'
(True True True True True)

Well, until you get to articles about J. F. Kennedy, but that will be covered in future post about regular expressions.

Just as there are tons of sentence terminals there are also many dashes, 29 to be precise. Now you can find them easily:

$ raku -e '( "-", "—", "⸺", "⸻" ).map( *.uniprops( "Dash" ) ).flat.say'

(True True True True)

Did you know that hyphen ‐ should not be slapped everywhere you mean dash? According to rules semi-final uses hyphen, inclusion I was — as always — hungry uses em-dash (named after width of letter m) and ranges 2020–2023 uses en-dash (named after width of letter n). Yeah, right...

For brackets Unicode brings one more tool to your toolbox - you can check if bracket is opening or closing and you can find matching one:

$ raku -e '.say for "(}".uniprops;'

Ps # Open_Punctuation
Pe # Close_Punctuation

$ raku -e '.say for "(}".uniprops( "Bidi_Mirroring_Glyph" );'

)
{

Can you feel the power already?

This post only scratched the surface of Unicode properties. But with this knowledge you can get any piece of text and be able to parse it without knowing all the letters and symbols used in different scripts.

Coming up next: Composed / decomposed forms.

DEV Community

UTF-8 code point properties

Top comments (0)

Read next

Building and Distributing Multi-Architecture Docker Images

🎄 A Christmas Gift for Developers: FileToMarkdown!

Working with Strings in Elixir

How to run llama 405b bf16 with gh200s