This is my final report for the Google Summer of Code 2020.
What is the software?
The software is called Caribay and it is a PEG (Parsing Expression Grammar) Parser Generator built with LpegLabel with support of automatic generation of error labels and error recovery rules. The generated parser captures a generic AST (Abstract Syntactic Tree) or a list of thrown errors. Caribay makes easier to parse lexical symbols, comments, identifiers and keywords using its own syntax. The source code delivered for the GSoC 2020 can be found here on the branch final-report-gsoc2020, which is already merged in master.
Basic usage
You need to require the module caribay.generator:
local generator = require"caribay.generator"
then you call the gen
function passing a PEG as argument to generate an LPegLabel parser:
local src = [[
assign <- ID '=' number
fragment number <- FLOAT / INT
INT <- %d+
FLOAT <- %d+ '.' %d+
]]
local match = generator.gen(src)
match[[ a_2 = 3.1416 ]]
Keep reading to learn more about the other features or read the documentation.
What I did during GSoC:
The project consisted of developing a parser for the input grammar, a preprocessor for computing some sets (ex.: FIRST and FOLLOW), the implementation of an algorithm (with optimizations optionally enabled by the user) for automatically generate the error labels and the translator to LPegLabel patterns.
The parser
The parser for the input grammars was built also with LPegLabel. The syntax for the grammars is very similar to the one from the re module, but with some new syntax and semantics:
Lexical and Syntactic symbols
Caribay differentiates lexical symbols from syntactic symbols as UPPER_CASE symbols and snake_case symbols, respectively. The difference is that lexical symbols capture all (and only) the pattern they match as a new AST node, while a syntactic symbol captures a new AST node but with an array with all the children nodes.
Predefined symbols
Caribay serves some useful predefined symbols, which are overwritten if the user defines them:
SKIP
It is used to skip a pattern between lexical symbols. If the user defines a COMMENT
rule, it becomes part of the ordered choice of SKIP
.
ID
The ID rule is defined by default by this grammar:
ID <- ID_START ID_END?
ID_START <- [a-zA-Z]
ID_END <- [a-zA-Z0-9_]+
User can define their own ID_START
and ID_END
rules.
Literals
Regular literals
To match a single literal the user writes the following grammar:
s <- 'this is a literal'
Captured literals
To also capture the literal's text in a syntactic rule, the user writes it with double quotes:
s <- "a"
The AST captured when matching a
is:
{
tag = 's', pos = 1,
{ tag = 'token', pos = 1, 'a' }
}
Keywords
Keywords, which are surrounded by backsticks, are a special type of literals: Caribay captures them (when used on syntactic rules) and wraps them around code that ensures they are not confused with identifiers. Basically when matching a keyword kw, Caribay in reality matches:
`kw`!ID_END
and when matching an identifier, Caribay checks that it is not a keyword defined in the grammar.
Other features
There are also fragments, semantic actions, named groups, skippable nodes, error labels, and other features. For reading more about them see the documentation.
The preprocessor
In the code, this module is called the annotator and it computes the well known FIRST and FOLLOW sets for the symbols and also what I call the LAST and CONTEXT sets:
- LAST: It is the set of possible last tokens or "tails" of the pattern. It is like the opposite of the FIRST set.
- CONTEXT: It is the set of tokens that can come before the pattern. It is like the opposite of the FOLLOW set.
The algorithm
The implemented algorithm is called The Unique Algorithm and it is based on the research work of my mentor Sergio Medeiros. Basically it finds safe places to automatically insert error labels and error recovery rules in the grammar, using FIRST and FOLLOW sets.
An optional optimization for increasing the number of automatically inserted error labels, which I called Unique Context Optimization and is also based on the research work of Sergio, was implemented using LAST and CONTEXT sets.
The translator
When running The Algorithm Unique, Caribay translates the grammar to a LPegLabel grammar, generating LPegLabel patterns from the AST nodes returned by the parser.
What is missing
There are two pending features:
- A way of printing the original grammar but with the new error labels.
- Implement another optimization called Unique Syntactical Symbols for increasing the number of automatically generated error labels.
Personal Retrospective
First of all, I really enjoyed working on this super-computer-scientific project alongside Sergio, which was very helpful, polite and active answering my questions and suggesting features.
This was my first time using my knowledge of programming languages theory in a project outside my academic work, but besides this being a summer job, working with Sergio and in this project felt as exciting and rewarding as my favorite academic courses and projects from my university.
Acknowledgements
Thanks to Sergio Medeiros for being such a good mentor, to LabLua for hosting this project and to my buddy Germán Robayo for being very motivating about participating in the GSoC (please read his blog!).
Top comments (0)