Code point (ssometimes written as "codepoint") is an ordinal position in addressable encoding space.
In ASCII code points were very straightforward because addressable space was continuous. Binary value of a character converted to decimal was a code point. There were 128 code points defined, as you already know from previous posts. For example a
character with binary value of 01100001
is at codepoint 97
.
raku -e '0b01100001.say'
97
Raku also provides convenient method to get decimal codepoints:
$ raku -e '"a".ord.say'
97
In UTF-8 things get complicated. In previous post about UTF-8 internal design I explained that 1
xxxxxxx
starting byte is forbidden in multibyte characters, which makes namespace non-continuous.
UTF-8 code points are usually written in hexadecimal notation as U+0105
. Let's first learn how to convert code point to binary value of character.
1. Convert hexadecimal value to bits.
$ raku -e '0x0105.base( 2 ).say'
100000101
2. Find smallest possible character byte length that can fit this amount of bits (9 in this case). Control bits does not count.
-
0
xxxxxxx
- This has 7 bits left, too small. -
110
xxxxx
10
xxxxxx
- This has 11 bits left, perfect!
3. Fill free bits with our codepoint 100000101
bits starting from the right.
110xx
100
10
000101
4. Fill remaining free bits with 0
s.
110
00
100
10000101
5. Done:
11000100
10000101
Let's check which character U+0105
points to:
$ raku -e 'Buf.new( 0b11000100, 0b10000101 ).decode.say'
ą
And just to confirm:
$ raku -e '"ą".ord.base( 16 ).say'
105
The opposite conversion is straightforward - take binary representation of a character, throw away control bits and convert it to hexadecimal.
Coming up next: Glyphs and graphemes.
Top comments (0)