Unicode
So far in the book, all examples were meant for strings made up of ASCII characters only. However, Regexp class uses source encoding by default. And the default string encoding is UTF-8
. See ruby-doc: Encoding for details on working with different string encoding.
Encoding modifiers
Modifiers can be used to override the encoding to be used. For example, the n
modifier will use ASCII-8BIT instead of source encoding.
# example with ASCII characters only
>> 'foo - baz'.gsub(/\w+/n, '(\0)')
=> "(foo) - (baz)"
# example with non-ASCII characters as well
>> 'fox:αλεπού'.scan(/\w+/n)
(irb):2: warning: historical binary regexp match /.../n against UTF-8 string
=> ["fox"]
Character set escapes like \w
match only ASCII characters whereas named character sets are Unicode aware. You can also use (?u)
inline modifier to allow character set escapes to match Unicode characters.
>> 'fox:αλεπού'.scan(/\w+/)
=> ["fox"]
>> 'fox:αλεπού'.scan(/[[:word:]]+/)
=> ["fox", "αλεπού"]
>> 'fox:αλεπού'.scan(/(?u)\w+/)
=> ["fox", "αλεπού"]
See ruby-doc: Regexp Encoding for other such modifiers and details.
Unicode character sets
Similar to named character classes and escape sequences, the \p{}
construct offers various predefined sets that will work for Unicode strings. See ruby-doc: Character Properties for full list and details.
# extract all consecutive letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{L}+/)
=> ["fox", "αλεπού", "eagle", "αετός"]
# extract all consecutive Greek letters
>> 'fox:αλεπού,eagle:αετός'.scan(/\p{Greek}+/)
=> ["αλεπού", "αετός"]
# extract all words
>> 'φοο12,βτ_4,foo'.scan(/\p{Word}+/)
=> ["φοο12", "βτ_4", "foo"]
# delete all characters other than letters
# \p{^L} can also be used instead of \P{L}
>> 'φοο12,βτ_4,foo'.gsub(/\P{L}+/, '')
=> "φοοβτfoo"
Codepoints and Unicode escapes
For generic Unicode character ranges, specify codepoints using \u{}
construct. The below snippet also shows how to get codepoints (numerical value of a character) in Ruby.
# to get codepoints from string
>> 'fox:αλεπού'.codepoints.map { |i| '%x' % i }
=> ["66", "6f", "78", "3a", "3b1", "3bb", "3b5", "3c0", "3bf", "3cd"]
# one or more codepoints can be specified inside \u{}
>> puts "\u{66 6f 78 3a 3b1 3bb 3b5 3c0 3bf 3cd}"
fox:αλεπού
# character range example using \u{}
# all english lowercase letters
>> 'fox:αλεπού,eagle:αετός'.scan(/[\u{61}-\u{7a}]+/)
=> ["fox", "eagle"]
See also: codepoints, a site dedicated for Unicode characters.
\X vs dot metacharacter
Some characters have more than one codepoint. These are handled in Unicode with grapheme clusters. The dot metacharacter will only match one codepoint at a time. You can use \X
to match any character, even if it has multiple codepoints.
>> 'g̈'.codepoints.map { |i| '%x' % i }
=> ["67", "308"]
>> puts "\u{67 308}"
g̈
>> 'cag̈ed'.sub(/a.e/, 'o')
=> "cag̈ed"
>> 'cag̈ed'.sub(/a..e/, 'o')
=> "cod"
>> 'cag̈ed'.sub(/a\Xe/, 'o')
=> "cod"
Another difference is that \X
will match newline characters by default.
>> "he\nat".sub(/e.a/, 'ea')
=> "he\nat"
>> "he\nat".sub(/e.a/m, 'ea')
=> "heat"
>> "he\nat".sub(/e\Xa/, 'ea')
=> "heat"
Exercises
For practice problems, visit Exercises.md file from this book's repository on GitHub.
Top comments (0)