DEV Community

Paweł bbkr Pabian
Paweł bbkr Pabian

Posted on • Edited on

UTF-8 grapheme clusters

Grapheme cluster is a sequence of code points that should be treated as single unit when processed.

The most famous grapheme cluster is CRLF line break.



$ raku -e '"\r".chars.say' # CR (carriage return)
1

$ raku -e '"\n".chars.say' # LF (line feed)
1

$ raku -e '"\r\n".chars.say' # CRLF
1 # still 1 character

$ raku -e '"\r\n".codes.say'
2

$ raku -e '"\r\n".NFC.say'
NFC:0x<000d 000a> # does not compose


Enter fullscreen mode Exit fullscreen mode

Unlike composition, sequence of code points does not produce another code point as a result. Grapheme cluster has length of 1 but original code points remain unchanged.

Why?

Concept of grapheme clusters is used for rendering and editing purposes. For example when your cursor is before grapheme cluster and you press right arrow it should move to the end of grapheme cluster. And notice that your text editor does just that! If you have CRLF line endings set and press right arrow at the end of the line it goes to beginning of the new line right away. It does not go to beginning of current line (carriage return) first and then to lower line (line feed). Does not require pressing arrow two times, even if you jump over two code points.

Same for text selection - grapheme cluster should be selected as single unit.

Although some editors are not so strict about it.

Properties

For better visualization let's use something more... visible. Like กำ - Thai KO KAI and SARA AM characters, that also form grapheme cluster.

  • Graphemes in cluster can not be separated:


$ raku -e '.say for "ab".comb'
a
b

$ raku -e '.say for "กำ".comb'
กำ


Enter fullscreen mode Exit fullscreen mode

Raku note: Function comb is complementary to better known cousin split. Instead of saying what is the separator it says what should be extracted. Without params it extracts array of characters. But we can be explicit:



$ raku -e '.say for "กำ".comb: /./'
กำ


Enter fullscreen mode Exit fullscreen mode

Which also gives another clue, that in UTF-8 aware regular expressions grapheme cluster is indeed matched as single character:



$ raku -e 'say "กำ" ~~ /./'
「กำ」


Enter fullscreen mode Exit fullscreen mode
  • Graphemes in cluster can not be flipped:


$ raku -e '"ab".flip.say'
ba

$ raku -e '"กำ".flip.say'
กำ  # same


Enter fullscreen mode Exit fullscreen mode

I'm mentioning it explicitly to emphasize difference with composition, where combining characters could be given in any order.

Memorization trick

If you still cannot grasp/remember difference between composition and clusters think of Mortal Kombat:

MK vanilla logo

  • Grapheme cluster = combo. Each punch and kick is visible on their own, but they form an uninterruptible chain.
  • Composition = fatality. Individual punch and kicks are not shown, but whole sequence produces a new move instead.

Coming up next: Sorting and collation.

Top comments (0)