Juan Julián Merelo Guervós

Posted on Dec 17, 2017 • Edited on Jul 25, 2023

Matching things with Raku grammars

#grammars #rakulang #regexes #raku

Previously on this series, we learned how to define a grammar in Perl6 and how to use it for parsing a paragraph. And I say we because that was my objective when starting this, to learn to use grammars myself so that I can put them to good use. Eventually, I would like to make a Markdown parser, if such a thing is possible, but in coding, the path is more important than the destination and I intend to take you along in this trip.

And one of the things that characterizes markdown is matching quote-like construct such as the asterisks I have used in the source of this post to make quote-like stand out as italics, or the backticks I have used for quote-like. We can create a mini-grammar for (maybe) quoted words this way:

grammar Simple-Quoted {
    token TOP { ^  <quoted> $}
    token quoted { <quote>? <letters> <quote>?  } 
    token quote { "*"|"`" }
    token letters { \w+ }
}

my $quoted = "*enhanced*";
my $parsed = Simple-Quoted.parse($quoted);
say $parsed;

We always start at the TOP rule, which says that there should be a quoted word, and that's it. That's why we use ^ and $ to anchor the beginning and the end. If there's more than one word, it won't work. This will, and it will print:

｢*enhanced*｣
 quoted => ｢*enhanced*｣
  quote => ｢*｣
  letters => ｢enhanced｣
  quote => ｢*｣

This is a printout of a Match-like structure that uses once again the square quotes ｢｣ and that has, first, the matched string and then a hash which effectively includes as parts the different parts, in tree structure, that have been destructured from the imput. This is not terribly ugly, with indentation telling you a bit about the structure, but is not all there is. We will use Brian D. Foy's PrettyDump to check it out. Let's just change the last line to

say $parsed.perl;

Which, once reformatted, looks like this:

 Match.new(
    list => (),
    made => Any,
    pos => 10,
    hash => Map.new(
    (:quoted(Match.new(
            list => (),
            made => Any,
            pos => 10,
            hash => Map.new(
                (:letters(
                 Match.new(
                     list => (),
                     made => Any,
                     pos => 9,
                     hash => Map.new(()),
                     orig => "*enhanced*",
                     from => 1)),
                 :quote(
                 [Match.new(
                     list => (),
                     made => Any,
                     pos => 1,
                     hash => Map.new(()),
                     orig => "*enhanced*",
                     from => 0),
                  Match.new(
                      list => (),
                      made => Any,
                      pos => 10,
                      hash => Map.new(()),
                      orig => "*enhanced*",
                      from => 9)]))),
            orig => "*enhanced*",
            from => 0)))),
    orig => "*enhanced*",
    from => 0)

What we see here is that grammars create a recursive set of Matches. This is simply a hash of hashes, but we can also use Match methods for accessing it; there's roughly one method per key, and keys in Perl 6 are those things before the fat arrow. So

say $parsed.hash;

Will return

Map.new((quoted => ｢*enhanced*｣
 quote => ｢*｣
 letters => ｢enhanced｣
 quote => ｢*｣))

But this is actually the big data structure in the first Match level. If we want to access the innermost structure, we'll have to do:

 $parsed.hash<quoted>.hash

which will return Map.new((letters => ｢enhanced｣, quote => [｢*｣｢*｣])). That's where we want to be. We have the quotes, and whatever is inside it. We can work with that.

Don't worry, there's an easier way of doing that. Keep reading this series.

Mismatched matches

The witty reader will probably have noticed that mismatched quotes will also be happily parsed:

> Simple-Quoted.parse("*mismatch`");
｢*mismatch`｣
 quoted => ｢*mismatch`｣
  quote => ｢*｣
  letters => ｢mismatch｣
  quote => ｢`｣

That's not good. Not good at all. We have to change the grammar, and actually have it take into account that quotes must be the same at the beginning and the end of the word. Let us take a hint from regular expressions and let's reformulate it this way:

grammar Quoted {
    token TOP { ^ <letters> | <quoted> $}
    token quoted { (<quote>) <letters> $0  } 
    token quote { "*"|"`" }
    token letters { \w+ }
}

The only change is in the quoted token, which now captures the first quote and only matches if it is the same at the end; the $0 variable does just that; stores the match, and will not let that kind of crockery pass muster. Now

*enhanced`

will fail and return Any, that is, well, "We don't grokk this, this is a bad thing". Through the parentheses we capture, with the $0 we reproduce whatever was captured before. If it's not the same thing, it fails, but if it is the same quote, it works alright.

More grammars

Between the two post of this series, "The little match girl" was written in the Raku Advent Calendar, and it shows you how to create and test complex, and big, grammars. And of course, you can always check out Parsing with Perl 6 Regexes and Grammars: A Recursive Descent into Parsing, an excellent book by the very knowledgeable (and helpful) Moritz Lentz.

DEV Community

Matching things with Raku grammars

Mismatched matches

More grammars

Top comments (0)

Read next

🎄 A Christmas Gift for Developers: FileToMarkdown!

Untitled

Neuer: The End of Framework Slavery

Offering help with a project