Today I needed to match some unicode chars in an Elixir regex.
TL;DR:
Use u
modifier and \x{...}
, e.g. ~r/\x{1234}/u
Matching a unicode char in a regex
More specifically, I needed to remove all zero width chars from a string.
These are U+200B
, U+200C
, U+200D
and U+FEFF
.
Trying to use \u
does not work:
iex(1)> ~r/\u200B/
** (Regex.CompileError) PCRE does not support \L, \l, \N{name}, \U, or \u at position 1
(elixir) lib/regex.ex:209: Regex.compile!/2
(elixir) expanding macro: Kernel.sigil_r/2
iex:1: (file)
Looking at the docs, it seems that \x{}
is the way to go, but no:
iex(1)> ~r/\x{200B}/
** (Regex.CompileError) character value in \x{} or \o{} is too large at position 7
(elixir) lib/regex.ex:209: Regex.compile!/2
(elixir) expanding macro: Kernel.sigil_r/2
iex:1: (file)
The trick is that we need to apply a unicode
(u
) modfier to the regex, telling the regex compiler that we're working in Unicode:
iex(1)> ~r/\x{200B}/u
~r/\x{200B}/u
iex(2)> "Hello,\u200BWorld!" |> String.replace(~r/\x{200B}/u, "")
"Hello,World!"
Yay!
So my final regex could be something like:
~r/\x{200B}|\x{200C}|\x{200D}|\x{FEFF}/u
Interpolation works too.
We can also interpolate strings into a regex, which works the same way and works without the u
modifer:
iex(5)> "Hello,\u200BWorld!" |> String.replace(~r/#{"\u200B"}/, "")
"Hello,World!"
Top comments (0)