DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

UTF-8 sorting and collation

Collation is an instruction on how to compare two texts. Usually code point values are used as a default collation - A with code point 65 is before a with code point 97:

$ raku -e 'say "A" cmp "a"'
Less

$ raku -e 'say "a" cmp "a"'
Same

$ raku -e 'say "a" cmp "A"'
More
Enter fullscreen mode Exit fullscreen mode

Such comparison function with three-way result is core feature of every language that implements text sorting. Sort algorithm probes different pairs of elements and relocates them in array to reach state in which every element has Less / Same relation to the next one.

$ raku -e '( "z", "m", "a", "m" ).sort( { $^a cmp $^b } ).say' # explicit
(a m m z)

$ raku -e '( "z", "m", "a", "m" ).sort.say' # implicit
(a m m z)
Enter fullscreen mode Exit fullscreen mode

Very often three-way comparison result is masked by short-circuit functions that return boolean results right away, for example to check if two texts are equal:

$ raku -e 'say ( "a" cmp "a" ) ~~ Same' # explicit
True

$ raku -e 'say "a" eq "a"' # short
True
Enter fullscreen mode Exit fullscreen mode

Collation levels

In Unicode collation is more complex and can have up to 4 levels. Meaning of each level is different and depends on script. For example in Latin those levels are:

  • primary = alphabetic
  • secondary = diacritics
  • tertiary = casing
  • quaternary = codepoint

Raku note: There is built-in Unicode collation support, which is controlled by $*COLLATION global object. Each level order can be controlled by setting it to More or Less or it can be ignored by setting it to Same.

Raku warning: Users must be explicit if they want to use code point collation or Unicode collation. There are separate methods that respect $*COLLATION settings. Instead of cmp there is coll, instead of sort there is collate.

Let's jump to examples, first looking at terrible result produced by regular code point sorting:

raku -e '
    ( "c", "A", "b", "Ć", "ą", "C", "a", "B" ).sort.say
'

(A B C a b c ą Ć)
Enter fullscreen mode Exit fullscreen mode

Compared to much more natural default Unicode collation:

$ raku -e '
    ( "c", "A", "b", "Ć", "ą", "C", "a", "B" ).collate.say;
'

(a A ą b B c C Ć)
Enter fullscreen mode Exit fullscreen mode

Primary level

Controls alphabetic order for Latin.

$ raku -e '
    $*COLLATION.set( primary => More ); # ascending, default
    ( "a", "b", "a", "b" ).collate.say;
'

(a a b b)

$ raku -e '
    $*COLLATION.set( primary => Less ); # descending
    ( "a", "b", "a", "b" ).collate.say;
'

(b b a a)

$ raku -e '
    $*COLLATION.set( primary => Same ); # ignored
    ( "a", "b", "a", "b" ).collate.say;
'

(a a b b)
Enter fullscreen mode Exit fullscreen mode

You may wonder why in the last example the result is still sorted. This is because we now have a tie. Alphabetic level is ignored, diacritics and casings levels are the same. So quaternary level was used to resolve tie.

Secondary level

Controls diacritics order for Latin.

raku -e '
    $*COLLATION.set( secondary => More ); # diacritics after base, default
    ( "a", "ą", "a", "ą" ).collate.say;
'

(a a ą ą)


$ raku -e '
    $*COLLATION.set( secondary => Less ); # diacritics before base
    ( "a", "ą", "a", "ą" ).collate.say;
'
(ą ą a a)
Enter fullscreen mode Exit fullscreen mode

Personally I never found controlling this level useful. Are there any alphabets that have diacritics before base characters?

Tertiary level

Controls casing order for Latin.

$ raku -e '
    $*COLLATION.set( tertiary => More ); # lowercase first, default
    ( "a", "A", "a", "A" ).collate.say;
'

(a a A A)

$ raku -e '
    $*COLLATION.set( tertiary => Less ); # uppercase first
    ( "a", "A", "a", "A" ).collate.say;
'

(A A a a)
Enter fullscreen mode Exit fullscreen mode

Quaternary level

If previous 3 levels were unable to determine order then code point comparison is the last resort for Latin script. To verify it let's disable this level as well:

$ raku -e '
    $*COLLATION.set( primary => Same, quaternary => Same );
    ( "a", "b", "a", "b" ).collate.say
'

(a b a b)
Enter fullscreen mode Exit fullscreen mode

As expected elements were returned in original order.

Alphabet sorting

You may notice that in some cases Unicode collation does not produce order in the alphabet/language you are using. This is because many languages may have different order within the same script. For example let's compare:

Estonia and Germany

Estonian: ABDEFGHIJKLMNOPRSŠZŽTUVÕÄÖÜ
German: AÄBCDEFGHIJKLMNOÖPQRSßTUÜVWXYZ

Unicode acknowledges those differences and provides language specific collations along default "International" one.

Raku note: There is Language param in $*COLLATION object, however it is not yet supported. So let's make example using MySQL:

> CREATE TABLE collation_test (data text) Engine = InnoDB;

> INSERT INTO collation_test (data) values ("A"), ("Ä"), ("Z");

> SELECT * FROM collation_test ORDER BY data COLLATE utf8mb4_estonian_ci;
+------+
| data |
+------+
| A    |
| Z    |
| Ä    |
+------+

> SELECT * FROM collation_test ORDER BY data COLLATE utf8mb4_german2_ci;
+------+
| data |
+------+
| A    |
| Ä    |
| Z    |
+------+
Enter fullscreen mode Exit fullscreen mode

Stroked letters

Luckily Unicode collation handles stroked letters properly, despite the fact they do not decompose (do not have base letter) as explained in this post:

$ raku -e '( "m", "ł", "l", "n" ).collate.say'

(l ł m n) # ł is where expected, not after n code point
Enter fullscreen mode Exit fullscreen mode

What was skipped?

Unicode collation is freakishly complex. Take a look at Unicode::Collate Perl library to appreciate how deep this rabbit hole goes. I intentionally skipped CJK stuff, DUCET tables and much more stuff not suitable for "Introduction" series.

Coming up next: Fun with variables and operators (optional). Regular expressions.

Top comments (0)