Overview
At my place of current employment I write libraries in Rust which are called by other languages, on multiple operating systems, on multiple platforms. Furthermore, it involves taking in streams of data from different communications buses and decoding it into usable forms by other developers on those other platforms, in those other languages! As a result, I've got rust projects with hundreds of structs that are exposed through a Rust FFI.
I have zero desire to maintain those translations by hand, so I wrote a tool to do it for me. There are multiple ways to go about this, I could have written a naive translator which goes line by line and and expects the rust structs to look a very specific way. That, however, would not have been maintainable, or able to be shared, so I instead used a parsing library called Lark. Lark allows me to write a context-free grammar to specify the language I'm expecting, and provide a nice interface for me to turn defined symbols into python objects which are easy to work with. Once I have those objects, building source code in other languages is fairly trivial.
You can see the whole project here: https://github.com/LivingInSyn/Rust_Api_Generator
The Context Free Grammar
I won't go into too much detail on all of the ins and outs of writing a grammar. I will mostly be focusing on the grammar that I wrote. Lark has a good tutorial here, which goes much more in depth
When writing a grammar from scratch, I find that it's best to work from the inside out. That is, start with the most basic element, and work backwards. Thing names (variables, structs, etc) are a great place to start, and that's where we'll start today. We'll call our first symbol 'name' and it looks like this:
?name: /[a-z_A-Z][a-z_A-Z0-9]*/
The question mark operator in front says, if there's only one child of this symbol, don't wrap it in another object (when it's parsed), just give me the child. The ? is followed by the name of our symbol, which in this case is simply 'name'. Next is a regular expression which defines what a name can look like. In this Case, it's an upper or lowercase A-Z, or an underscore, followed by an unlimited number of alphanumeric characters and underscores.
Next we're going to define something we call an rtype. This won't make much sense until we make it further in the grammar, but suffice to say, an rtype is defined as either a name, or an array declaration. It looks like this
?rtype: name
| array
Once we have a definition for an rtype, we can define a pointer modifier that we'll use in other declarations.
pointer: "*" mutable
| "*" const
mutable: "mut"
const: "const"
Here we defined 3 symbols, pointer, mutable and const. We made mutable and const their own symbols because we want to be able to perform different actions based on their properties. Pointer is defined by the string "*" followed by one of those two symbols, which are in turn, defined by strings
Once we have a definition for an rtype and a pointer, we can define something called a 'modified type' This tells us if we're looking at a name in another declaration if it's a pointer or not. The brackets ([]) around the pointer symbol indicate that it's optional. Thus
*mut i16
and
i16
would both be matched by the ?modifiedtype symbol.
?modifiedtype: [pointer] rtype
Now lets backtrack a moment and define 'array' because we already used it in our rtype definition. Array is defined by an opening bracket, a modifiedtype, a semicolon, one or more digits that starts with 1-9 (can't be 0). Lastly, it must end with a closing bracket.
array: "[" modifiedtype ";" /[1-9]{1}[0-9]*/ "]"
Now that we've got a good handle on what some of the context free grammar rules look like, lets skip ahead a bit and see the rest of what we need to define a struct
reprc: "#[repr(C)]"
comment: /\/\/[^\n]*/
decl: [ispub] name ":" modifiedtype [","]
| comment
?ispub: "pub"
struct: [reprc] [ispub] "struct" name "{" decl+ "}"
Here we define the repr(C) instruction to rust which tells rust to have this struct use C-compliant memory layout. We'll use this (and pub) to decide which structs to translate. If it's not repr(C) and pub, it doesn't make sense to translate it, since it won't work over an FFI interface. We declare that a decl is a combination of if it's public or not, a name, a colon, a modified type, and an optional comma. It can also be comment, since we allow comments inside of struct definitions. It's worth noting here that we intentionally designed the grammar to NOT handle multiline comments. A preprocessor will remove them later. We want single line comments to move over, NOT multiline comments.
We define a struct as possibly reprc, possibly public, the word struct, a name, an opening brace, one or more decl's and a closing brace.
Next, we're going to define a few symbols that we're not going to translate between languages. We have to define them in our grammar so that it doesn't break when we parse the file. They are 'using' statements, 'extern crate' statements and 'impl' statements. The definitions are here:
externcrate: "extern" "crate" name [optionalas] ";"
?optionalas: "as" name
usedecl: "use" namespace+ ";"
?namespace: ["::"] name
impl: ["unsafe"] "impl" name "for" name "{}"
The final step is to define a start point for the file and a top level definition. There's also some instructions for Lark, basically ignore whitespace, and handle things as strings wherever possible.
?statement: struct
| usedecl
| comment
| impl
| externcrate
?start: statement+
//for use by lark
%import common.ESCAPED_STRING
%import common.WS
%ignore WS
The full context free grammar can be found here
Using the CFG
Now that we've got a CFG, lets parse it into an abstract syntax tree. We will work on the following rust struct as a starting point:
#[repr(C)]
pub struct bar {
pub unsigned_32bit: u32,
pub arraytest: [u8;10],
pub pntr_array: [*mut c_char;4],
}
Next, we'll run the following python code. Basically, we're creating the parser, passing it in our grammar file that we made in the previous section. Then we're asking it to parse some input, and print the resulting syntax tree, in a pretty way.
from lark import Lark
gf = open("rgrammar.g", 'r')
parser = Lark(gf.read(), parser="lalr")
tree = parser.parse('''#[repr(C)]
pub struct bar {
pub unsigned_32bit: u32,
pub arraytest: [u8;10],
pub pntr_array: [*mut c_char;4],
}''')
print(tree.pretty())
running this code gives the following output:
struct
reprc
ispub
bar
decl
ispub
unsigned_32bit
u32
decl
ispub
arraytest
array
u8
10
decl
ispub
pntr_array
array
modifiedtype
pointer
mutable
c_char
4
This is our abstract syntax tree. We can see that the struct has a reprc on it, is public, is named bar, and that it has 4 decl's in it, with all of their associated fields.
Conclusion
In this article, we talked about creating the CFG and using it to create a basic abstract syntax tree using Lark. In the next section(s), we'll use this information to massage the objects created by lark into a more usable form, and then finally, output code in other languages.
Top comments (2)
I wonder whether you could use Rust's macro system for the AST creation.
I mean, it already builds an AST there, so you could use a macro that exposes a text version of the AST directly from the Rust compiler.
That way you wouldn't need to re-define the language externally.β¦
Yup, I think you could. Other people suggested this on Reddit as well. This was fun to write though, and relatively quick