geoffreycopin

Posted on Aug 10, 2022

Getting started with Sylver - Part1 : Building a JSON parser in 49 lines of code

#json #parser #analysis

Sylver is a language agnostic platform for building custom source code
analyzers (think eslint for every language).
This might be a lot to unpack, so let us explore this tool
by solving a real-world problem: our application's configuration is stored in
complex JSON documents, and we'd like to build a tool to automatically validate
these documents against our business rules.

In this series of tutorials, we'll go from having zero knowledge of Sylver or
static analysis to building a fully-fledged linter for our configuration files.
We will use JSON as an example, but the tools and techniques presented
apply to many data formats and even complete programming languages!
Also, note that while we will be building everything from scratch using
Sylver's domain-specific languages (DSL), a catalog of built-in specifications for the most common languages will be included in future releases of the tool.

In part 1, we will discover Sylver's meta language, a DSL used to describe the shape
of programming languages and data formats. After completing this tutorial, we'll have
a complete spec for the JSON language, allowing us to turn JSON documents into
Sylver parse trees.
In part 2, we will load the parse trees into Sylver's query engine,
and we will find nodes that violate our business rules using an SQL-like query
language.
In part 3, we will learn how to turn our language spec and queries into a set of
linting rules so that we can run them conveniently and share them.

Installation

Sylver is distributed as a single static binary. Installing it is as simple as:

Go to https://sylver.dev to download the binary for your platform
Unpack the downloaded archive
Move the sylver binary to a location in your $PATH

Prelude

Let us create a blank workspace for this tutorial!

We'll start by creating a new folder and a json.syl file to write
the Sylver specification for the JSON language.

mkdir sylver_getting_started
cd sylver_getting_started
touch json.syl

We will also store our test JSON file in config.json.

{
    "variables": [
        {
            "name": "country",
            "description": "Customer's country of residence",
            "values": ["us", "fr", "it"]
        },
        {
            "name": "age",
            "description": "Cusomer's age",
            "type": "number"
        }
    ]
}

This file specifies the variables used to describe customer profiles in a fictional
customer database.
Variables are assigned a type or a list of potential values.

Types definition

Sylver parses raw text into typed structures called parse trees. The first step in defining a language spec is to define a set of types for our tree nodes.
We'll add the following node types declarations to our json.syl spec:

node JsonNode { }

node Null: JsonNode { }

node Bool: JsonNode { }

node Number: JsonNode { }

node String: JsonNode { }

node Array: JsonNode { 
    elems: List<JsonNode> 
}

node Object: JsonNode {
    members: List<Member>
}

node Member: JsonNode {
    key: String,
    value: JsonNode
}

These declarations resemble object type declarations in many mainstream languages.
The : syntax denotes inheritance.

Now that we have a set of types to describe JSON documents, we need to specify how to build a
parse tree from a sequence of characters. This process is done in two steps:

Lexical analysis: in this step, individual characters that form an indivisible entity (such as the digits of a number or the characters of a string) are grouped together into tokens. Some tokens are only one character-wide (for example, the brackets and semicolons in JSON).
Syntactic analysis, in which tree nodes are built for the stream of tokens.

Lexical analysis

Tokens are described using declarations of the form term NAME = <term_content> where
<term_content> is either a literal surrounded by single-quotes (') or a regex
between backticks (`). The regexes use a syntax similar to Perl-style regular expressions.

Characters in the input string that match one of the terminal literals
or regexes will be grouped into a token of the given name.

term COMMA = ','
term COLON = ':'
term L_BRACE = '{'
term R_BRACE = '}'
term L_BRACKET = '['
term R_BRACKET = ']'
term NULL = 'null'

term BOOL_LIT = `true|false`
term NUMBER_LIT = `\-?(0|([1-9][0-9]*))(.[0-9]+)?((e|E)(\+|-)?[0-9]+)?`
term STRING_LIT = `"([^"\\]|(\\[\\/bnfrt"])|(\\u[a-fA-F0-9]{4}))*"`


ignore term WHITESPACE = `\s`

Term rules for numbers and strings are slightly involved in accounting for some
of JSON's peculiarities.

Note that the WHITESPACE term declaration (matching a single whitespace character) is prefixed with the ignore keyword. This means that WHITESPACE tokens do not affect
the structure of the document and can be ignored during syntactic analysis.

Syntactic analysis

In this last part of the language spec, we write rules describing how tree nodes are built by matching tokens from the input stream.

For example, a rule specifying: "if the current token is a STRING_LIT, build a
String node" can be written as follows:

rule string = String { STRING_LIT }

Rules can refer to other rules to construct nested nodes.
For example, here is a rule specifying that a Member node (corresponding to an object member
in JSON) can be built by building a node using the string rule and then matching a COLON token followed by any valid JSON value:

rule member = Member { key@string COLON value@main }

Nested nodes are associated with a field using the @ syntax.
The main rule is the entry point for the parser, so in our case, it designates any valid JSON value.

A valid JSON document can be made of a 'null' literal, a number, a boolean value,
a string, an array of JSON values, or a JSON object, which is reflected in the main rule:

rule main =
    Null { NULL }
  | Number { NUMBER_LIT }
  | Bool { BOOL_LIT }
  | string
  | Array { L_BRACKET elems@sepBy(COMMA, main) R_BRACKET }
  | Object { L_BRACE members@sepBy(COMMA, member) R_BRACE }

The sepBy(TOKEN, rule_name) syntax is used to parse nodes using the main rule,
while matching a TOKEN token between every parsed node.

Conclusion

We now have a complete language spec for the JSON language:

node JsonNode { }

node Null: JsonNode { }

node Bool: JsonNode { }

node Number: JsonNode { }

node String: JsonNode { }

node Array: JsonNode { 
    elems: List<JsonNode> 
}

node Object: JsonNode {
    members: List<Member>
}

node Member: JsonNode {
    key: String,
    value: JsonNode
}

term COMMA = ','
term COLON = ':'
term L_BRACE = '{'
term R_BRACE = '}'
term L_BRACKET = '['
term R_BRACKET = ']'
term NULL = 'null'

term BOOL_LIT = `true|false`
term NUMBER_LIT = `\-?(0|([1-9][0-9]*))(.[0-9]+)?((e|E)(\+|-)?[0-9]+)?`
term STRING_LIT = `"([^"\\]|(\\[\\/bnfrt"])|(\\u[a-fA-F0-9]{4}))*"`


ignore term WHITESPACE = `\s`

rule string = String { STRING_LIT }

rule member = Member { key@string COLON value@main }

rule main =
    Null { NULL }
  | Number { NUMBER_LIT }
  | Bool { BOOL_LIT }
  | string
  | Array { L_BRACKET elems@sepBy(COMMA, main) R_BRACKET }
  | Object { L_BRACE members@sepBy(COMMA, member) R_BRACE }

The last step is to test it on our test file!
This is done by invoking the following command:
sylver parse --spec=json.syl --file=config.json

Which yields the following parse tree:

Object {
. ● members: List<Member> {
. . Member {
. . . ● key: String { "variables" }
. . . ● value: Array {
. . . . ● elems: List<JsonNode> {
. . . . . Object {
. . . . . . ● members: List<Member> {
. . . . . . . Member {
. . . . . . . . ● key: String { "name" }
. . . . . . . . ● value: String { "country" }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { "description" }
. . . . . . . . ● value: String { "Customer's country of residence" }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { "values" }
. . . . . . . . ● value: Array {
. . . . . . . . . ● elems: List<JsonNode> {
. . . . . . . . . . String { "us" }
. . . . . . . . . . String { "fr" }
. . . . . . . . . . String { "it" }
. . . . . . . . . }
. . . . . . . . }
. . . . . . . }
. . . . . . }
. . . . . }
. . . . . Object {
. . . . . . ● members: List<Member> {
. . . . . . . Member {
. . . . . . . . ● key: String { "name" }
. . . . . . . . ● value: String { "age" }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { "description" }
. . . . . . . . ● value: String { "Customer's age" }
. . . . . . . }
. . . . . . . Member {
. . . . . . . . ● key: String { "type" }
. . . . . . . . ● value: String { "number" }
. . . . . . . }
. . . . . . }
. . . . . }
. . . . }
. . . }
. . }
. }
}

In the next part, we'll define business rules to validate our JSON
configuration (for example, the possible values for each variable must be
of the same type), and we will use a query DSL to identify the tree nodes that violate these rules.

DEV Community

Getting started with Sylver - Part1 : Building a JSON parser in 49 lines of code

Installation

Prelude

Types definition

Lexical analysis

Syntactic analysis

Conclusion

Top comments (0)

Read next

Rejected By Grad Schools, Accepted By Code: My Transition into Tech

Carla Simulator 1 : How to Set Up CARLA Simulator – Toretto Style 🏎️🔥

Carla Simulator 2 : Welcome to the Ride, Brian Cooper Here! 🚗🏍️

Getting Started with Java: A Beginner's Guide