Grapheme cluster is a sequence of code points that should be treated as single unit when processed.
The most famous grapheme cluster is CRLF line break.
$ raku -e '"\r".chars.say' # CR (carriage return)
1
$ raku -e '"\n".chars.say' # LF (line feed)
1
$ raku -e '"\r\n".chars.say' # CRLF
1 # still 1 character
$ raku -e '"\r\n".codes.say'
2
$ raku -e '"\r\n".NFC.say'
NFC:0x<000d 000a> # does not compose
Unlike composition, sequence of code points does not produce another code point as a result. Grapheme cluster has length of 1 but original code points remain unchanged.
Why?
Concept of grapheme clusters is used for rendering and editing purposes. For example when your cursor is before grapheme cluster and you press right arrow it should move to the end of grapheme cluster. And notice that your text editor does just that! If you have CRLF line endings set and press right arrow at the end of the line it goes to beginning of the new line right away. It does not go to beginning of current line (carriage return) first and then to lower line (line feed). Does not require pressing arrow two times, even if you jump over two code points.
Same for text selection - grapheme cluster should be selected as single unit.
Although some editors are not so strict about it.
Properties
For better visualization let's use something more... visible. Like กำ
- Thai KO KAI
and SARA AM
characters, that also form grapheme cluster.
- Graphemes in cluster can not be separated:
$ raku -e '.say for "ab".comb'
a
b
$ raku -e '.say for "กำ".comb'
กำ
Raku note: Function comb
is complementary to better known cousin split
. Instead of saying what is the separator it says what should be extracted. Without params it extracts array of characters. But we can be explicit:
$ raku -e '.say for "กำ".comb: /./'
กำ
Which also gives another clue, that in UTF-8 aware regular expressions grapheme cluster is indeed matched as single character:
$ raku -e 'say "กำ" ~~ /./'
「กำ」
- Graphemes in cluster can not be flipped:
$ raku -e '"ab".flip.say'
ba
$ raku -e '"กำ".flip.say'
กำ # same
I'm mentioning it explicitly to emphasize difference with composition, where combining characters could be given in any order.
Memorization trick
If you still cannot grasp/remember difference between composition and clusters think of Mortal Kombat:
- Grapheme cluster = combo. Each punch and kick is visible on their own, but they form an uninterruptible chain.
- Composition = fatality. Individual punch and kicks are not shown, but whole sequence produces a new move instead.
Coming up next: Sorting and collation.
Top comments (0)