Yuan Gao

Posted on Aug 22, 2021

PEG Parsers: sometimes more appropriate than Regex

#javascript #programming

Had a quick project recently, that inspired me to write a quick blog post about PEG parsers. Diving right in:

The problem/why I did this

Some friends have a little game project called Loungeware, a wario-ware collection of minigames, with contributions from the GameMaker community.

Its website needs a gallery of the games, and we wanted a way to keep this gallery up to date without someone having to manually go through the contributed games and copying over the metadata.

The data already exists in the repository in the form of code files for the game, so why can't we just process these and pull the data out for the website? That way the website can easily be kept up to date simply by reading the code that's already there! That's the basis of the problem.

How to solve this?

The game is written in GML, a C-syntax dynamic language, it shares some resemblance to Javascript. Here's what we have to extract:

As you can see, this is more or less indistinguishable from Javascript. It's really tempting to just feed this through as javascript, but that would lead to some weird code execution vulnerabilities.

So what are our options? Regex? It's the first thing that comes to mind when faced with some sort of data extraction problem. Can we just Regex this whole thing? I guess we could, but it would result in an incredibly long and complex Regex pattern.

Ok, so to reduce the complexity of a long Regex pattern, maybe we could split the task up into individual parts? Search for each occurance of microgame_register and then take the text after that and feed that through individual Regex patterns to extract each key? This would be better, it would make the Regex patterns more managable, and we can rely on the structure of the code to help us with decoding it.

Ok, so why not take this to the logical extreme? If the code is, at the end of the day, well-structured. What if we defined the rules for how the code should be put together? Let's say we defined rules like "An array starts with [ followed by some number of variables separated by commas and ending with ]"? This. This is exactly what PEG is for.

PEG.js

In past blog posts, where I've written about PEG, I've used Parsimonious in Python, such as three of my solutions to the 2020 Advent Of Code challenges (here, (here)[https://dev.to/meseta/advent-of-code-day-18-finally-using-peg-grammar-in-python-in-the-way-it-s-supposed-to-3253], and (here)[https://dev.to/meseta/advent-of-code-day-19-abusing-peg-grammar-in-python-the-way-it-s-not-supposed-to-2beg]). This time, because the rest of the website is javascript, I will be using PEG.js instead to avoid adding an extra programming language to the codebase.

PEG.js has a distinct advantage over parsimonious in that it has a nice web-based tool to help you write your grammar. I'll be using this online tool to walk you through how I went about writing a PEG grammar needed to process the above GML code into JSON.

Step 1: Whitespace

I like to go from inside->out. Take the smallest and most primitive elements and then build upwards. Since a lot of my data is in the form of numbers. I need to add PEG rules for matching and extracting them. Since unlike parsimonious which lets you use full-on regex for pattern, PEG.js only allows much simpler pattern matches, I'm going to define two rules, one for integers, and one for floats:

Number
  = Float / Integer

Float
  = "-"? ([0-9]+ "." [0-9]* / [0-9]* "." [0-9]+) { return parseFloat(text()); }

Integer
  = "-"? [0-9]+ { return parseInt(text(), 10); }

PEG matches from top down. And the text must match the first rule in its entirety. So at the moment, this PEG grammar will match a single Float or Integer. I use Javascript's handy parseInt() and parseFloat() functions to turn the captured text into an actual Javascript number.

Note: this pattern ([0-9]+ "." [0-9]* / [0-9]* "." [0-9]+) matches .0 and 0. but not .

Step 2: variable names

Some of the values in the data point to specific variables. These are easy to match, since they only allow characters a-z, A-Z, 0-9 and _, the so-called "word" characters.

Word
  = [a-zA-Z0-9_]+ { return text(); }

This is going to return the string of the variable name, which is fine by us because we don't actually need to resolve them for this use-case. If we were actually building a programming language rather than just extracting data, we'd probably at this point need to return an object representing a variable to distinguish it from a string literal. But in our case here, we are okay to treat variables like string literals.

Step 3: Booleans

We have a few booleans in our text. These are simple too, we just need to match true or false and return a javascript boolean

Boolean
  = bool:("true" / "false") { return bool === 'true' }

Step 4: String literals

String literals are much harder because we have to be able to match escaped quotes like this: "hello \"world\"" so we can't just find all the text between two double-quotes. To do this, we have to define a new rule that matches either regular characters, or specifically escaped quotes:

StringLiteral
  = str:("\"" CharDoubleQuoted* "\"") { return str[1].join(""); }

CharDoubleQuoted
  =  "\\\"" / [^"]

the str[1] is needed because we want to return the string without the quotes. and the .join("") is needed because it'll return an array of characters.

We actually have to duplicate this to support both double and single quoted characters. so the rules end up looking like this:

StringLiteral
  = str:("\"" CharDoubleQuoted* "\"" / "'" CharSingleQuoted* "'") { return str[1].join(""); }

CharDoubleQuoted
  =  "\\\"" / [^"]

CharSingleQuoted
  =  "\\'" / [^']

Step 5: Putting them together

So a value could be any one of the above rules. We can define now a rule that says "a Value can be any of these"

Value
  = Boolean / StringLiteral / Number / Word

StringLiteral
  = str:("\"" CharDoubleQuoted* "\"" / "'" CharSingleQuoted* "'") { return str[1].join(""); }

CharDoubleQuoted
  =  "\\\"" / [^"]

CharSingleQuoted
  =  "\\'" / [^']

Boolean
  = bool:("true" / "false") { return bool === 'true' }

Word
  = [a-zA-Z0-9_]+ { return text(); }

Number
  = Float / Integer

Float
  = "-"? [0-9]* "." [0-9]* { return parseFloat(text()); }

Integer
  = "-"? [0-9]+ { return parseInt(text(), 10); }

This PEG doesn't do anything particularly interesting. It'll convert numbers into actual numbers (rather than just strings of unmbers), bools into bools, correctly capture escaped strings, and turns variables into string literals. But nevertheless, we needed all this as the building blocks.

Step 6: Arrays

An array is simply any number of the above Value, surrounded by square brackets, and separated with commas. Oh, and there's a bunch of extra whitespace.

Array
  = "[" _ items:(Value _ "," _)* last:(Value) _ "]" {
      return items.map(v => v[0]).concat([last]);
    }

_ "whitespace"
  = [ \t\n\r]*

Unfortunately it's a bit harder to handle due to the fact that there's a comma after each value except for the last one. If we wrote just (Value ",")* then each value, including the last one would need a comma after it (e.g. [1,2,3,]. So we have to handle that edge-case separately with (Value ",")* Value. Incidentally, a rule like this doesn't match empty arrays, but I'm going to ignore that for now.

We can also add "Array" to our "Value" pattern to allow for nested arrays! At this point, our PEG pattern can match string, number, and boolean literals, variable names, and arrays that are made up of these things.

Step 7: Structs

In GML, Structs are a lot like javascript object notation. or Key:Value pairs surrounded by curly brackets, and separated with commas.

Struct
  = "{" _ items:(Item _ "," _)* last:(Item) _ "}" {
      return Object.fromEntries(items.map(v => v[0]).concat([last]));
    }

Item
  = key:Word _ ":" _ value:Value { return [key, value] }

Here, I'm having the Item match key:value pairs, and return an array, which Struct can turn into an Object using .fromEntries() method.

Adding this to our "Value" pattern now allows nested structs too!

Step 8: Game registration

So, we could keep going and define all the language features like function calls and algebraic expressions. But in our case here we don't need to because these files should only contain struct literals and value literals. So we're going to take a shortcut and make a rule for specifically the microgame_register() function:

Registration
  = _ "microgame_register(" _ name:StringLiteral _ "," _ config:Struct _ ")" _ ";"? _ {
      return {name: name, config: config} ;
    }

Since we did all the groundwork, that's all it takes! We know that the first argument is always a string literal, and we know the second argument is always a Struct, so we just say so.

As can be seen in the screenshot, our PEG parser is now able to parse a single invocation of microgame_register() and spit out the name and config struct as a Javascript object.

Step 9: Multiple registrations per file

The final step is that a single fine can contain multiple registrations, so all we need is a new top-level rule. The first rule in the PEG file is important, as this rule must match the whole input, so it's something of a "parent".

All
  = reg:Registration* { return reg; }

And that's it! This now lets us handle multiple "Registration" in a file.

In its entirety, the PEG grammar is:

All
  = reg:Registration* { return reg; }

Registration
  = _ "microgame_register(" _ name:StringLiteral _ "," _ config:Struct _ ")" _ ";"? _ {
        return {name: name, config: config} ;
    }

Value
  = Struct / Array /Boolean / StringLiteral / Number / Word

Struct
  = "{" _ items:(Item _ "," _)* last:(Item) _ "}" {
      return Object.fromEntries(items.map(v => v[0]).concat([last]));
    }

Item
  = key:Word _ ":" _ value:Value { return [key, value] }

Array
  = "[" _ items:(Value _ "," _)* last:(Value) _ "]" {
      return items.map(v => v[0]).concat([last]);
    }

StringLiteral
  = str:("\"" CharDoubleQuoted* "\"" / "'" CharSingleQuoted* "'") { return str[1].join(""); }

CharDoubleQuoted
  =  "\\\"" / [^"]

CharSingleQuoted
  =  "\\'" / [^']

Boolean
  = bool:("true" / "false") { return bool === 'true' }

Word
  = [a-zA-Z0-9_]+ { return text(); }

Number
  = Float / Integer

Float
  = "-"? [0-9]* "." [0-9]* { return parseFloat(text()); }

Integer
  = "-"? [0-9]+ { return parseInt(text(), 10); }

_ "whitespace"
  = [ \t\n\r]*

A set of easy-to-explain rules can come together to extract the structure of the GML code, and produce a Javascript object containing the data we want.

I hope this has been helpful in explaining a little about the process you can take to write your own PEG grammar to parse whatever it is that you needed to parse, and how PEG grammars may be an alternative to an unwieldy regex pattern.

As a rule of thumb, I suggest thinking like this: if the document you are matching has a lot of structure, like a programming language, or a data format, then PEG grammars are more appropriate and a lot more flexible than Regex, since you can make use of this structure to help you match the data. Good luck!

Cover Photo by Quaritsch Photography on Unsplash

Top comments (1)

Keff • Aug 28 '21

Nice one! Glanced over it back when you posted it but did not come back to it. Quite comprehensive, I've used PEG in the past and you break it down quite well

DEV Community