Working with matched portions
Having seen a few regexp features that can match varying text, you'll learn how to extract and work with those matching portions in this chapter. First, you'll learn about the match
method and the resulting MatchData
object. Then you'll learn about scan
method and how capture groups affects scan
and split
methods. You'll also learn how to use global variables related to regexp.
match method
First up, the match
method which is similar to match?
method. Both these methods accept a regexp and an optional index to indicate the starting location. Furthermore, these methods treat a string argument as if it was a regexp all along (which is not the case with other string methods like sub
, split
, etc). The match
method returns a MatchData
object from which various details can be extracted like the matched portion of string, location of matched portion, etc. nil
is returned if there's no match for the given regexp.
# only the first matching portion is considered
>> 'abc ac adc abbbc'.match(/ab*c/)
=> #<MatchData "abc">
# string argument is treated the same as a regexp
>> 'abc ac adc abbbc'.match('a.*d')
=> #<MatchData "abc ac ad">
# second argument specifies starting location to search for a match
>> 'abc ac adc abbbc'.match(/ab*c/, 7)
=> #<MatchData "abbbc">
The regexp grouping inside ()
is also known as a capture group. It has multiple uses, one of which is the ability to work with matched portions of those groups. When capture groups are used with match
method, they can be retrieved using array index slicing on the MatchData
object. The first element is always the entire matched portion and rest of the elements are for capture groups if they are present. The leftmost (
will get group number 1
, second leftmost (
will get group number 2
and so on.
# retrieving entire matched portion using [0] as index
>> 'abc ac adc abbbc'.match(/a.*d/)[0]
=> "abc ac ad"
# capture group example
>> m = 'abc ac adc abbbc'.match(/a(.*)d(.*a)/)
# entire matching portion and capture group portions
>> m.to_a
=> ["abc ac adc a", "bc ac a", "c a"]
# only the capture group portions
>> m.captures
=> ["bc ac a", "c a"]
# getting a specific capture group portion
>> m[1]
=> "bc ac a"
The offset
method gives the starting and ending + 1 indexes of the matching portion. It accepts an argument to indicate entire matching portion or specific capture group. You can also use begin
and end
methods to get either of those locations.
>> m = 'awesome'.match(/w(.*)me/)
>> m.offset(0)
=> [1, 7]
>> m.offset(1)
=> [2, 5]
>> m.begin(0)
=> 1
>> m.end(1)
=> 5
There are many more methods available. See ruby-doc: MatchData for details.
>> m = 'THIS is goodbye then'.match(/hi.*bye/i)
>> m.regexp
=> /hi.*bye/i
>> m.string
=> "THIS is goodbye then"
named_captures
method will be covered in Named capture groups section.
match method with block
The match
method also supports block form, which is executed only if the regexp matching succeeds.
>> 'abc ac adc abbbc'.match(/a(.*)d(.*a)/) { |m| puts m[2], m[1] }
c a
bc ac a
>> 'abc ac adc abbbc'.match(/xyz/) { 2 * 3 }
=> nil
Using regexp as a string index
If you are a fan of code golfing, you can use a regexp inside []
on a string object to replicate some features of the match
and sub!
methods.
# same as: match(/c.*d/)[0]
>> 'abc ac adc abbbc'[/c.*d/]
=> "c ac ad"
# same as: match(/a(.*)d(.*a)/)[1]
>> 'abc ac adc abbbc'[/a(.*)d(.*a)/, 1]
=> "bc ac a"
# same as: match(/ab*c/, 7)[0]
>> 'abc ac adc abbbc'[7..][/ab*c/]
=> "abbbc"
>> word = 'elephant'
# same as: word.sub!(/e.*h/, 'w')
>> word[/e.*h/] = 'w'
=> "w"
>> word
=> "want"
scan method
The scan
method returns all the matched portions as an array. With match
method you can get only the first matching portion.
>> 'abc ac adc abbbc'.scan(/ab*c/)
=> ["abc", "ac", "abbbc"]
>> 'abc ac adc abbbc'.scan(/ab+c/)
=> ["abc", "abbbc"]
>> 'par spar apparent spare part'.scan(/\bs?pare?\b/)
=> ["par", "spar", "spare"]
It is a useful method for debugging purposes as well, for example to see what is going on under the hood before applying substitution methods.
>> 'that is quite a fabricated tale'.scan(/t.*a/)
=> ["that is quite a fabricated ta"]
>> 'that is quite a fabricated tale'.scan(/t.*?a/)
=> ["tha", "t is quite a", "ted ta"]
If capture groups are used, each element of output will be an array of strings of all the capture groups. Text matched by regexp outside of capture groups won't be present in the output array. Also, you'll get an empty string if a particular capture group didn't match any character. See Non-capturing groups section if you need to use groupings without affecting scan
output.
# without capture groups
>> 'abc ac adc abbc xabbbcz bbb bc abbbbbc'.scan(/ab*c/)
=> ["abc", "ac", "abbc", "abbbc", "abbbbbc"]
# with single capture group
>> 'abc ac adc abbc xabbbcz bbb bc abbbbbc'.scan(/a(b*)c/)
=> [["b"], [""], ["bb"], ["bbb"], ["bbbbb"]]
# multiple capture groups
# note that last date didn't match because there's no comma at the end
# you'll later learn better ways to match such patterns
>> '2020/04/25,1986/Mar/02,77/12/31'.scan(%r{(.*?)/(.*?)/(.*?),})
=> [["2020", "04", "25"], ["1986", "Mar", "02"]]
Use block form to iterate over the matched portions.
>> 'abc ac adc abbbc'.scan(/ab+c/) { |m| puts m.upcase }
ABC
ABBBC
>> 'xx:yyy x: x:yy :y'.scan(/(x*):(y*)/) { |a, b| puts a.size + b.size }
5
1
3
1
split with capture groups
Capture groups affects split
method as well. If the regexp used to split contains capture groups, the portions matched by those groups will also be a part of the output array.
# without capture group
>> '31111111111251111426'.split(/1*4?2/)
=> ["3", "5", "6"]
# to include the matching portions of the regexp as well in the output
>> '31111111111251111426'.split(/(1*4?2)/)
=> ["3", "11111111112", "5", "111142", "6"]
If part of the regexp is outside a capture group, the text thus matched won't be in the output. If a capture group didn't participate, that element will be totally absent in the output.
# here 4?2 is outside capture group, so that portion won't be in output
>> '31111111111251111426'.split(/(1*)4?2/)
=> ["3", "1111111111", "5", "1111", "6"]
# multiple capture groups example
# note that the portion matched by b+ isn't present in the output
>> '3.14aabccc42'.split(/(a+)b+(c+)/)
=> ["3.14", "aa", "ccc", "42"]
# here (4)? matches zero times on the first occasion, thus absent
>> '31111111111251111426'.split(/(1*)(4)?2/)
=> ["3", "1111111111", "5", "1111", "4", "6"]
Use of capture groups and optional limit as 2 gives behavior similar to partition
method.
# same as: partition(/a+b+c+/)
>> '3.14aabccc42abc88'.split(/(a+b+c+)/, 2)
=> ["3.14", "aabccc", "42abc88"]
regexp global variables
An expression involving regexp also sets regexp related global variables, except for the match?
method. Assume m
is a MatchData
object in the below description of four of the regexp related global variables.
-
$~
containsMatchData
object, same asm
-
$`
string before the matched portion, same asm.pre_match
-
$&
matched portion, same asm[0]
-
$'
string after the matched portion, same asm.post_match
Here's an example:
>> sentence = 'that is quite a fabricated tale'
>> sentence =~ /q.*b/
=> 8
>> $~
=> #<MatchData "quite a fab">
>> $~[0]
=> "quite a fab"
>> $`
=> "that is "
>> $&
=> "quite a fab"
>> $'
=> "ricated tale"
For methods that match multiple times, like scan
and gsub
, the global variables will be updated for each match. Referring to them in later instructions will give you information only for the final match.
# same as: { |m| puts m.upcase }
>> 'abc ac adc abbbc'.scan(/ab+c/) { puts $&.upcase }
ABC
ABBBC
# using 'gsub' for illustration purpose, can also use 'scan'
>> 'abc ac adc abbbc'.gsub(/ab+c/) { puts $~.begin(0) }
0
11
# using global variables afterwards will give info only for the final match
>> $~
=> #<MatchData "abbbc">
>> $`
=> "abc ac adc "
If you need to apply methods like
map
and use regexp global variables, usegsub
instead ofscan
.
>> sentence = 'that is quite a fabricated tale'
# you'll only get information for last match with 'scan'
>> sentence.scan(/t.*?a/).map { $~.begin(0) }
=> [23, 23, 23]
# 'gsub' will get you information for each match
>> sentence.gsub(/t.*?a/).map { $~.begin(0) }
=> [0, 3, 23]
In addition to using $~
, you can also use $N
where N is the capture group you want. $1
will have string matched by the first group, $2
will have string matched by the second group and so on. As a special case, $+
will have string matched by the last group. Default value is nil
if that particular capture group wasn't used in the regexp.
>> sentence = 'that is quite a fabricated tale'
>> sentence =~ /a.*(q.*(f.*b).*c)(.*a)/
=> 2
>> $&
=> "at is quite a fabricated ta"
# same as $~[1]
>> $1
=> "quite a fabric"
>> $2
=> "fab"
>> $+
=> "ated ta"
>> $4
=> nil
# $~ is handy if array slicing, negative index, etc are needed
>> $~[-2]
=> "fab"
>> $~.values_at(1, 3)
=> ["quite a fabric", "ated ta"]
Using hashes
With the help of block form and global variables, you can use a hash variable to determine the replacement string based on the matched text. If the requirement is as simple as passing entire matched portion to the hash variable, both sub
and gsub
methods accept a hash instead of string in replacement section.
# one to one mappings
>> h = { '1' => 'one', '2' => 'two', '4' => 'four' }
# same as: '9234012'.gsub(/1|2|4/) { h[$&] }
>> '9234012'.gsub(/1|2|4/, h)
=> "9two3four0onetwo"
# if the matched text doesn't exist as a key, default value will be used
>> h.default = 'X'
>> '9234012'.gsub(/./, h)
=> "XtwoXfourXonetwo"
For swapping two or more strings without using intermediate result, using a hash object is recommended.
>> swap = { 'cat' => 'tiger', 'tiger' => 'cat' }
>> 'cat tiger dog tiger cat'.gsub(/cat|tiger/, swap)
=> "tiger cat dog cat tiger"
For hashes that have many entries and likely to undergo changes during development, building alternation list manually is not a good choice. Also, recall that as per precedence rules, longest length string should come first.
>> h = { 'hand' => 1, 'handy' => 2, 'handful' => 3, 'a^b' => 4 }
>> pat = Regexp.union(h.keys.sort_by { |w| -w.length })
>> pat
=> /handful|handy|hand|a\^b/
>> 'handful hand pin handy (a^b)'.gsub(pat, h)
=> "3 1 pin 2 (4)"
Substitution in conditional expression
The sub!
and gsub!
methods return nil
if substitution fails. That makes them usable as part of a conditional expression leading to creative and terser solutions.
>> num = '4'
>> puts "#{num} apples" if num.sub!(/5/) { $&.to_i ** 2 }
=> nil
>> puts "#{num} apples" if num.sub!(/4/) { $&.to_i ** 2 }
16 apples
>> word, cnt = ['coffining', 0]
>> cnt += 1 while word.sub!(/fin/, '')
=> nil
>> [word, cnt]
=> ["cog", 2]
Exercises
For practice problems, visit Exercises.md file from this book's repository on GitHub.
Top comments (0)