DEV Community

Jin
Jin

Posted on • Edited on • Originally published at github.com

Tree - AST which crushes JSON, XML, YAML, TOML, etc

Hello, my name is Dmitriy Karlovskiy and I.. ride bicycles.. off-road.. against the wind.. uphill.. on skis. And today I invite you to take a ride with me along and across textual data formats and design the ideal format together.

Image description

I already talked about it 5 years ago, which led to heated debates that resulted in minor syntax changes. Therefore, let me tell you from scratch what it is at the moment.

Meta

Speech
    Speaker \Dmitry Karlovsky
    Place \PiterJS #47
    Time 2020-05-20
Enter fullscreen mode Exit fullscreen mode

This is an extended text version of the speech of the same name on PiterJS#47. You can read it as an article or open it in the presentation interface or watch video.

Plan

  • Analyze popular text data formats πŸ’©
  • From scratch, develop a new format without flaws πŸ‘½
  • Show examples of applying the new format πŸ‘Ύ

Formats

We will compare 5 formats.

Format
XML
JSON
YAML
TOML
tree

Only the deaf have not heard about the first three. But the last two are dark horses for many. Well, nothing, today I will shed light on them.

XML example

XML - once the most popular format, you can say "technological standard". But despite all its power, it is now becoming obsolete, as it is too complicated for a modern web developer.

<!DOCTYPE svg
    PUBLIC "-//W3C//DTD SVG 1.1//EN"
    "http://www.w3.org/Graphics/SVG/1.1/DTD/svg11.dtd"
>
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<svg version="1.1" xmlns="http://www.w3.org/2000/svg">
    <circle r="30" cx="50" cy="50" fill="orange" />
</svg>
Enter fullscreen mode Exit fullscreen mode

JSON example

XML is being replaced by a simpler and more daring data format - JSON.

{
    "name": "example",
    "version": "1.0.0",
    "description": "example package",
    "main": "index.js",
    "repository": "https://example.org",
    "author": "anonymous",
    "license": "MIT"
}
Enter fullscreen mode Exit fullscreen mode

If you think that this is the ideal, then I ask you to excuse me in advance, as I will upset you further.

YAML example

Someone is already prophesying YAML to replace JSON.

Date: 2001-11-23 15:03:17-5
User: ed
fatal:
  Unknown variable "bar"
Where:
  file: TopClass.py
  line: 23
  code: |
    x = MoreObject("345\n")
Enter fullscreen mode Exit fullscreen mode

Due to its better human readability, it has already gained popularity in the field of manually writing configuration files.

TOML Example

Few have heard of TOML. However, take a look at the example and it will become clear why I mention it at all.

[servers]

[servers.alpha]
ip="10.0.0.1"
dc="eqdc10"

[servers.beta]
ip="10.0.0.2"
dc="eqdc10"
Enter fullscreen mode Exit fullscreen mode

Yes, it's actually a standardized INI config bitten by JSON. As a result, he absorbed the worst of both worlds.

Example Tree

Finally, as a spoiler, let me show you the minimal non-empty tree file that we will develop next.

spoiler
Enter fullscreen mode Exit fullscreen mode

Data models

Different formats are based on different data models. The chosen model answers the following two questions.

  • What data can we write and read without a tambourine? πŸ₯
  • How to record data that does not fit into the model? πŸ‘ 

No single format is able to support the entire variety of types of subject areas, so the need inevitably arises for packing data into a certain format and then unpacking it back.

XML Model

XML is based on a typed element model that contains one dictionary of attributes and one list of nested typed nodes.

  • NodeList
  • Element Node (<br/>)
  • Attribute Node (tabindex="1")
  • Text Node(Hello, World!)
  • CDATA Node (<![CDATA[ ... ]]>)
  • Processing Instruction Node (<? ... ?>)
  • Comment Node (<!-- ... -->)
  • Document Node
  • Document Type Node (<!DOCTYPE html>)

Disadvantages of the XML Model

This model is quite flexible, but it has a number of limitations: only strings can be attribute values, and there can be only one nested list of nodes. Despite the fact that the XML format is already not the simplest, a banal dictionary with subtrees as values ​​requires additional agreements. For example, this: some elements are used to describe the keys in the parent element and such elements in the parent should be only in one instance.

<panel>
    <head>Are you sure?</head>
    <body>
        <button>Yes</button>
        <button>No</button>
    </body>
</panel>
Enter fullscreen mode Exit fullscreen mode

Here panel is a component, and body is no longer a component, but a parameter. It would have a place in the attributes, but only the strings can be placed in the attributes and nothing more.

XML model extensibility

Thanks to namespaces, many languages ​​can be mixed up within one XML document without breaking the interpretation of each other.

<xsl:stylesheet
    version="1.0"
    xmlns="http://www.w3.org/1999/xhtml"
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

    <xsl:template match="/">
        <html>
            <head>
                <link rel="stylesheet" href="web.css" />
            </head>
            <body>
                <xsl:apply-templates select="*" />
            </body>
        </html>
    </xsl:template>

</xsl:stylesheet>
Enter fullscreen mode Exit fullscreen mode

This is a very powerful technique that is lacking in younger formats.

JSON Model

The JSON model is based on the fact that the entire tree consists of untyped lists and dictionaries. Plus a limited set of primitives as tree leaves.

  • Null
  • Boolean
  • Number
  • String
  • Array
  • Dictionary

Disadvantages of the JSON model

It would be naive to believe that two types of structural nodes are enough for everything. For example, let's take a dictionary. The keys in it are not ordered, that is, they can be returned by the parser in any order.

{
    "foo": 777
    "bar": 666
}
Enter fullscreen mode Exit fullscreen mode

What if we need a dictionary with ordered keys?

[
    [ "foo" , 777 ],
    [ "bar" , 666 ]
]
Enter fullscreen mode Exit fullscreen mode

We had to radically change the syntax and stick arrays of arrays. But this is just another type of dictionary.

Non-extensible JSON model

Well, the main drawback of the JSON model is its non-extensibility, which is why you have to introduce a bunch of tricky rules to cram all the variety of application types of their relations. Take, for example, a query to MongoDB, whose authors decided that JSON is a great fit for the role of a query language.

{
    "$or": [
        {
            "sex": "female",
            "age": { "$gt": 16 },
        },
        {
            hobby: {
                "$regex": "\\b(?:java|type)script\\b"
            }
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

We see that the paired logical operations OR and AND have a completely different syntax. The equality predicate is sorely lacking, because we still need the predicates "greater than", "less than" and even "matches the regular expression". And by the way, regular expressions themselves cannot be represented in JSON except as a string and an agreement that if it is in the dictionary for a key named "$regexp", then this is a serialized regular expression and when parsing, you need to create the corresponding object.

YAML Model

The YAML model is similar in many ways to the JSON model. Unless there is support for time and internal links.

  • !!null
  • !!bool
  • !!int
  • !!float
  • !!str
  • !!timestamp
  • !!seq
  • !!map
  • Anchor & Alias
  • Document
  • TypeTags

YAML model extensibility

The main advantage of YAML is in type annotations, which allow you to explain to the processor which algorithm to use to unpack the data.

--- !!omap
- foo:777
- bar: 666
Enter fullscreen mode Exit fullscreen mode

In this example, we are telling the parser to "take this list of key-value pairs" and convert it to an OrderedMap object (an ordered dictionary).

TOML Model

The TOML model is like JSON, but a little more mundane. For example, integers and real numbers are distinguished here, which is important for compiled languages, and there is also time support.

  • Boolean
  • Integer
  • Float
  • String
  • datetime
  • Array
  • Dictionary

With extensibility, everything is just as bad here as in JSON.

Model Tree

Whatever set of basic types we choose, it will not be enough for everything. This means that some packing and unpacking code will inevitably be required. And it is easiest to work with such code when the number of different types of nodes is minimal, since for each type you need to write a separate branch of logic. At the same time, maximum flexibility is required. Therefore, only two types of nodes will suffice for us.

  • Structure Node
  • Data Node

Structural nodes serve to describe the hierarchy, while data nodes store raw binary data. Any node can store a list of any other nodes, which achieves flexibility unattainable in other formats.

Model extensibility

Total, in terms of extensibility, everything is very bad. Popular formats are either extensible, but incredibly overcomplicated, or simple, but not extensible at all.

XML json YAML TOML tree
Extendability βœ… ❌ βœ… ❌ βœ…
Number of patterns 90 30 210 90 10

Pay attention to YAML. Its grammar has two hundred patterns. It is so complex that you will most likely not find any complete and correct implementation of its parser. Why, even two identically working JSON parsers you still need to search, but there would seem to be 30 patterns in total.

Our goal will be to create an extremely simple, unambiguous, but at the same time maximally extensible format.

Readability

Syntax clarity is important in a variety of scenarios for working with the format: when writing, when reviewing code, when resolving conflicts, when debugging, when studying.

The speed of your work and the predictability of its results directly depends on how the format is serialized. However, some formats have serious problems with this.

XML json YAML TOML tree
Readability ❌ ❌ βœ… βœ… βœ…

XML readability

XML is built around text with tags interspersed with additional information. As long as there is not a lot of this information, everything is fine, but the more it is, the more difficult it is to perceive the text, which eliminates the usefulness of this feature.

Hello Alice!
How are you?
Could you bring me coffee now?

<message>
    <greeting>
        Hi <a href="http://example.org/user/alice">Alice</a>!
    </greeting>
    <body>
        <s>How are you?</s><br/>
        Could you bring me
        <time datetime="1979-10-14T12:00:00.001-04:00">now</time>
        coffee?
    </body>
</message>
Enter fullscreen mode Exit fullscreen mode

JSON readability

XML at least supports multiline text, but JSON, for example, can no longer boast of this. Formats of this type come from an information structure, in which text and not only text values ​​are already interspersed.

{ "greetings": "Hi Alice!\nHow are you?\nCould you bring me some coffee?\n" }
Enter fullscreen mode Exit fullscreen mode

Severity

As a rule, there are no problems with understanding what is written. But YAML excelled here.

XML json YAML TOML tree
Unambiguous syntax βœ… βœ… ❌ βœ… βœ…

YAML lax

a: true # boolean
b: tru # string
c: :-) # string
d: (-: # error
Enter fullscreen mode Exit fullscreen mode

There are quite a few such jokes in YAML.

Escaping

A topic close to readability is escaping. The presence of this in one way or another inevitably leads to a decrease in readability. When designing escaping, the following points should be kept in mind.

  • It is necessary to distinguish format constructs from actual data 😡
  • It is advisable not to lose data in visibility πŸ€“
  • It is advisable not to overcomplicate editing 🀬

Escaping in XML

XML is a wonderful example of how not to do escaping.

foo > 0 && foo < 10
Enter fullscreen mode Exit fullscreen mode

From a simple and visual text, some kind of cryptotext is obtained, which has to be mentally interpreted in order to understand what is written here.

<code>foo &gt; 0 &amp;&amp; foo &lt; 10</code>
Enter fullscreen mode Exit fullscreen mode

Escaping in JSON

There is a similar problem with JSON, albeit to a lesser extent. If you have ever written plugins for VSCode syntax highlighting, then you know that grammars are described there in JSON format, where regular expressions are written.

/"[\s\S]*"/
Enter fullscreen mode Exit fullscreen mode

The regulars themselves are not the most visual things, but escaped ones are even worse. It is very easy to make a mistake in them in such conditions, and it is not very easy to debug them.

"\"[\\s\\S]*\""
Enter fullscreen mode Exit fullscreen mode

Escaping in YAML

In YAML, the escaping problem is generally solved, but at what cost.

  • 5 types of strings 😣
  • 4 whitespace handling modifiers πŸ˜₯

And all this you need to know in order to correctly read any YAML file.

Escaping in Tree

No πŸ€ͺ
Enter fullscreen mode Exit fullscreen mode

The most readable escaping is no escaping. Therefore, we will not have it. You might think that I'm crazy, but a little later I'll show you how to achieve this.

Minification

Many formats support different ways of formatting the same data. But it's always a trade-off between size and readability.

  • Readable formatting weighs a lot 🐘
  • Compact formatting is hard to read πŸ’€

XML minification

<users>
    <user>
        <name>Alice</name>
        <age>20</age>
    </user>
</users>
Enter fullscreen mode Exit fullscreen mode

If you minify XML, you can save several tens of percent in size, but the result is even more difficult to read.

<!-- 13% less -->
<users><user><name>Alice</name><age>20</age></user></users>
Enter fullscreen mode Exit fullscreen mode

JSON minification

{
    "users": [
        {
            "name": "Alice",
            age: 20
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

With JSON, the savings are slightly greater, but readability suffers more - instead of closing tags, we see a string of square and curly brackets.

// 30% less
{"users":[{"name":"Alice","age":20}]}
Enter fullscreen mode Exit fullscreen mode

Tree minification

No 😲
Enter fullscreen mode Exit fullscreen mode

Our path is uncompromising - the format must be both extremely compact and easily perceived by a person.

Statistics on minification

XML json YAML TOML tree
Readable 195% 140% 125% 110% 100%
Minified 170% 101% - - -

Download sample files.

As you can see, it is possible to make a format that in a readable form weighs less than any other, even if they are minified. The whole secret is that readability is achieved by the structure of the format itself, and does not require additional formatting that bloats the volume.

Holy Wars

A common problem when working with different formats is endless arguments about seemingly trifles.

  • Tabs or spaces? πŸ€Όβ€β™‚οΈ
  • 2 or 4 spaces? πŸ€Όβ€β™€οΈ
  • Do you need a carriage return? ⚑
  • Do we do alignment? 🀺
  • linter/format rules? πŸ”₯
  • when saving/committing/pushing? 🚧

These arguments take time and emotions, but they are completely meaningless. It is better if the format has uniform, clearly defined rules that are equally understood by any tool and person. Therefore, our format will be extremely rigid, without any liberties.

Processing speed

Simplicity, rigidity and lack of escaping potentially gives a much higher possible processing speed.

For example, in JSON, to write an arbitrary string, you need to go through each character and output a backslash to the output buffer before certain ones. That is, we cannot even know in advance how much memory we can allocate for the output buffer. And during parsing, you need to do the reverse operation with the formation of a new line. We cannot reuse the original piece of memory.

serialization: foo\bar => "foo\\bar"

parsing: "foo\\bar" => foo\bar
Enter fullscreen mode Exit fullscreen mode

When we do not have escaping, we can simply take chunks of memory and send them to the output stream during serialization, which is very fast. Conversely, when parsing, we can simply refer to pieces of the original buffer and not make extra memory allocations.

In my knee - length benchmark in the D language, the following results were obtained:

Tree: 299 ms
JSON: 421 ms
Enter fullscreen mode Exit fullscreen mode

For comparison, I used the naive implementation of the tree parser and json parser from the standard library.

Error coordinates

During parsing, information about the original location of the nodes obtained from the format is often lost. For example, we received JSON, started processing it, and somewhere in the depths we suddenly realized that in the database we do not have the user specified in the file. At this moment, we must show an error, but in the text of this error, we cannot indicate in which place of which file it was made. This is because this information is lost during parsing. And this is a very common problem.

XML json YAML TOML tree
Address βœ… ❌ ❌ ❌ βœ…
Position ❌ ❌ ❌ ❌ βœ…
Range ❌ ❌ ❌ ❌ βœ…

In XML nodes there is a link to the resource from which it was obtained, but where it is in this resource - look with your eyes. To solve this problem, there are special parsers that give the output not arrays and dictionaries, but an Abstract Syntax Tree. But working with him is no longer so easy, and even slowly this business.

Well, this information is important, and I suggest not to lose it. Never lose. Saving node coordinates will still come in handy when it comes to AST and sourcemaps.

Stream processing

It happens that there is a lot of data and little memory, but you need to work with data quickly. And it happens that the data does not end at all. For example, you need to continuously process logs as they come in. In these cases, the ability to stream data processing saves.

XML json YAML TOML tree
Streaming ❌ ❌ βœ… βœ… βœ…

As you can see, the most common formats do not have streaming support. They require you to have exactly one complete document root, otherwise it's a parsing error. In the case of constantly arriving data such as logs, for example, adding them to a document while maintaining its correctness is not an easy task.

This does not mean that stream processing cannot be fastened to them. For example, for XML, there are lower-level SAX parsers that allow you to work not with a tree of elements, but with a stream of tags: such and such a tag opened, a string arrived, such and such a tag closed. And for JSON, there is a whole bunch of message streaming protocols. The main problem here is that not every format-supporting tool will be able to digest your data without additional gestures.

Formats that support stream processing can be easily supplemented by appending data to the end. You can glue several data streams into one and, conversely, cut into pieces. Can be processed in parts without waiting for the transfer to complete. And all this without losing the correctness of working with the format.

Tree format

Well, summarizing what was said earlier, let's formulate all the requirements for our new format.

  • Easy syntax ✌
  • No escaping 🀘
  • No liberties πŸ€™
  • No minification πŸ‘
  • Minimum size πŸ‘
  • Guaranteed readability πŸ––
  • Stream processing πŸ’ͺ
  • Exact coordinates of nodes ☝

Just a tree node

So, we need to create a node named "house". What is the minimum code for this?

house
Enter fullscreen mode Exit fullscreen mode

We just write this name and that's it.

List of tree nodes

And if we need not one node, but a whole list?

house
roof
wall
door
window
floor
Enter fullscreen mode Exit fullscreen mode

We just write them on separate lines.

Nesting tree nodes

But what if we want to add hierarchies and put the list of nodes inside the first one?

house
    roof
    wall
    door
    window
    floor
Enter fullscreen mode Exit fullscreen mode

We just write nested nodes with a tab as an indent. Those familiar with the Python language may notice a similar approach here - using good code formatting style as the basis of the syntax, rather than an optional feature.

Deep tree hierarchy

By continuing to add padding, we can create hierarchies of any nesting.

house
    roof
    wall
        door
        window
            glass
    floor
Enter fullscreen mode Exit fullscreen mode

Alone at home

Often there are situations when there is only one nested node, and then it will be somehow wasteful to increase the indentation level for all nested nodes because of it.

street
    house
        wall
            door
            window
Enter fullscreen mode Exit fullscreen mode

Therefore, we simply line up such nodes in one line, separating them with spaces.

street house wall
    window
    door
Enter fullscreen mode Exit fullscreen mode

Indented nodes are already nested in the last node on the previous line.

Raw data

When we need to write arbitrary data, the characters in which should not be processed in any special way, we simply write them after the backslash without any escaping.

\Any data \(^_^)/
Enter fullscreen mode Exit fullscreen mode

The backslash is chosen to be associated with escaping. It sort of escapes the entire text to the end of the line. But, to be precise, it is rather not escaping, but a kind of quotation marks. The backslash is the opening mark, and the newline character is the trailing mark.

Multiline data

But how to write all the same multi-line text containing, among other things, newlines? It's simple: we take a data node and put a list of other data nodes in it.

\
    \Here πŸ±β€πŸ’»
    \   many πŸ±β€πŸ‘“
    \       cats πŸ±β€πŸ‘€
Enter fullscreen mode Exit fullscreen mode

When requesting the string content of the root data node, all nested data nodes will be concatenated via a newline character.

Different types of nodes

Finally, we can use both types of nodes mixed in any combination. For example, let's describe some user.

user
    name \Jin
    age \35
    hobby
        \kendo πŸ±β€πŸ‘€
        \dance πŸ•ΊπŸ½
        \roleplay 🎭
            default
Enter fullscreen mode Exit fullscreen mode

As you can see, everything is quite simple. To create the most advanced data format, we needed only 2 types of nodes and 4 special characters.

Languages ​​based on formats

So far, we have only talked about formats, that is, about serialization methods. On their basis, languages ​​are already being designed that add semantics to abstract format nodes.

Format Languages
XML XHTML, SVG, XSLT, ...
json JSON Schema, json:api, ...
YAML yaml.org/type
TOML -
tree xml.tree, json.tree, view.tree, ...

Any language is some subset of the format data model with restrictions on the possible types of nodes, their relative position and content.

Next, I will show some examples of such languages ​​for the tree format.

Language grammar.tree

Language grammar.tree - designed to describe formal grammars. For example, let's write a complete formal grammar for the tree format itself.

tree .is .optional .list_of line

line .is .sequence
    .optional indent
    .optional nodes
    new_line

nodes .is .sequence
    .optional .list_of struct
    .optional data
    .with_delimiter space

struct .is .list_of .byte
    .except special
Enter fullscreen mode Exit fullscreen mode
data .is .sequence
    data_prefix
    .optional .list_of .byte
        .except new_line

special .is .any_of
    new_line
    data_prefix
    indent
    space

new_line .is .byte \0A
indent .is .list_of .byte \09
data_prefix .is .byte \5C
space .is .list_of .byte \20
Enter fullscreen mode Exit fullscreen mode

As you can see, the grammar of the format is really extremely simple, which allows you to write a parser in any language in just an hour without even resorting to parser generators.

This grammar can be read literally: tree is an optional list of lines, and a line is a sequence of an optional indentation, an optional list of nodes, and a mandatory newline character. Well, and so on.

Language grammar.tree vs EBNF

Comparing grammar.tree with Extended Backus Naur Form you can see that the former is somewhat verbose but clear and concise, while the latter is compact , but for understanding it requires preliminary preparation, the expressive possibilities are still somewhat inferior, and its focus on a single-line representation looks somewhat awkward when using multi-line writing.

tree .is .optional .list_of line

line .is .sequence
    .optional indent
    .optional nodes
    new_line

nodes .is .sequence
    .optional .list_of struct
    .optional data
    .with_delimiter space
Enter fullscreen mode Exit fullscreen mode
tree = {line};

line=[indent],
    [ nodes ],
    new_line;

nodes = data |
    structure,
    { space , struct },
    [ space , data ];
Enter fullscreen mode Exit fullscreen mode

Language xml.tree vs XML

The xml.tree language is a way to represent an XML data model in tree format. Any kind of XML can be generated from it. Conversely, any XML can be converted to xml.tree.

! doctype html
html
    meta @ charset \utf-8
    link
        @ href \web.css
        @ rel \stylesheet
    script @ src \web.js
    body
        h1 \Procter & Gamble
Enter fullscreen mode Exit fullscreen mode
<!doctype html>
<html>

    <meta charset="utf-8" />
    <link href="web.css" rel="stylesheet" />
    <script src="web.js"></script>

    <body>
        <h1>Procter & Gamble</div>
    </body>

</html>
Enter fullscreen mode Exit fullscreen mode

It would be nice to have such integration in the IDE that when opening any XML, you can see and edit its xml.tree representation, but everything would be saved back to XML. This would eliminate the need to break your eyes over ampersands and make working with XML as easy and simple as, for example, with markdown.

Language json.tree vs JSON

And json.tree is a language for describing the json model.

* user *
    name \Jin
    age 35
    hobby /
        \kendo πŸ±β€πŸ‘€
        \dance πŸ•ΊπŸ½
    home \C:\users\jin\
Enter fullscreen mode Exit fullscreen mode
{
    "user": {
        "name": "Jin",
        age: 35
        "hobby": [
            "kendo πŸ±β€πŸ‘€",
            "dance πŸ•ΊπŸ½",
        ],
        "home": "C:\\users\\jin\\"
    }
}
Enter fullscreen mode Exit fullscreen mode

We needed only 2 special characters - an asterisk to denote dictionaries and a slash to denote arrays.

json.tree extensions

The beauty of languages ​​based on formats like XML and Tree is that they are easy to extend while staying within the format. For example, both json and tree as formats fundamentally do not support comments. But, for example, comments are necessary in configs. How to be?

*
    # \If disabled will be used platform specific delimiters
    # \CRLN on windows and LN on others
    unix_delimiters true
Enter fullscreen mode Exit fullscreen mode

In tree, we easily extended the language to suit our needs by adding a special node type for comments.

{
    "unix_delimiters#1": "If disabled will be used platform specific delimiters",
    "unix_delimiters#2": "CRLN on windows and LN on others",
    "unix_delimiters": true,
}
Enter fullscreen mode Exit fullscreen mode

In JSON, the limitations of the model are affected, because of which you have to write crutches.

Language view.tree vs TypeScript

Language view.tree - used for component composition in framework $mol developed by me.

$my_details $mol_view
    sub /
        <= Pager $mol_paginator
            value?val <=> page?val 0
Enter fullscreen mode Exit fullscreen mode

This describes a component that owns another component and their properties are bidirectionally related to each other. You may notice that inside view.tree the json.tree language is also used to describe arrays, dictionaries, numbers and other JSON types.

From such a simple and concise code, a rather sprawling TypeScript class is generated. You can write it with your hands, but it's a chore and without a hierarchy it's not very clear.

class $my_details extends $mol_view {

    sub() { return [ this.Pager() ] }

    @ $mol_mem Pager() {
        const Pager = new $mol_paginator
        Pager.value = val => this.page( val )
        return pager
    }

    @ $mol_mem page( val = 0 ) {
        return value
    }

}
Enter fullscreen mode Exit fullscreen mode

API

Finally, there are various APIs for interacting with the format from different programming languages.

Format Languages ​​ API
XML XHTML, SVG, XSLT, ... DOM, SAX, AST
json JSON Schema, json:api, ... Native, AST
YAML yaml.org/type Native, AST
TOML - Native, AST
tree xml.tree, json.tree, ... AST

For XML, for example, there is a fairly flexible DOM, and there is a low-level SAX. The formats that replaced it mainly return dictionaries, arrays, and so on native to the language. True, the JSON data model is not well represented in compiled languages, where integers and floats are completely different types. And of course, for all languages ​​there is a representation in the form of an Abstract Syntax Tree. True, it is usually slow and inconvenient. We will make it fast and convenient, which will allow us not to fence the zoo of incompatible APIs.

JSON AST

Let's take a simple JSON file and put it in ASTExplorer.

{
  "user": {}
}
Enter fullscreen mode Exit fullscreen mode
{
    "type" : "object",
    "children" : [
        {
            "type" : "Property",
            "key" : {
                "type": "Identifier",
                "value": "user"
            }
            "value": {
                "type": "object",
                "children": []
            }
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

As you can see, the AST turned out to be large and complex. JSON is generally very poorly suited to describe AST. It is not very easy to work with it without special utilities.

AST Tree

Now let's take a slightly more complex tree file.

user
    name \Jin
    age 35
    hobby
        \kendo πŸ±β€πŸ‘€
        \dance πŸ•ΊπŸ½
        \roleplay 🎭
Enter fullscreen mode Exit fullscreen mode

And look at his AST.

user
    name \Jin
    age 35
    hobby
        \kendo πŸ±β€πŸ‘€
        \dance πŸ•ΊπŸ½
        \roleplay 🎭
Enter fullscreen mode Exit fullscreen mode

So, something is wrong. It's the same code. Ah, no, that's right, tree is its own AST.

Tree node properties

In the TypeScript implementation, each node has roughly the following interface.

interface $mol_tree2 {
    type: string
    value: string
    kids: $mol_tree2[]
    span: $mol_span
}
Enter fullscreen mode Exit fullscreen mode

Span is a reference to a series of bytes in the original resource.

interface $mol_span {
    uri: string
    row: number
    col: number
    length: number
}
Enter fullscreen mode Exit fullscreen mode

Derived Tree nodes

Each node has methods for creating new nodes based on it. These factories, when creating new nodes, push the span from the original node into them. This allows even after dozens of transformations to understand how it all began.

interface $mol_tree2 {
    struct: ( type , kids )=> $mol_tree2
    data: ( value , kids )=> $mol_tree2
    list: ( kids )=> $mol_tree2
    clone: ( kids )=> $mol_tree2
}
Enter fullscreen mode Exit fullscreen mode

Error messages in Tree

For example, let's take the config, find the password in it, and if it doesn't work, we'll throw an exception, where it will be written in which place of which file the wrong password is written.

const config_path = './config.tree'
const config_text = fs.readFileSync( config_path )
const config = $mol_tree2.fromString( config_text , config_path )
// server auth
//  login \root
//  password \qwerty

const password = config.select( 'server' , 'auth' , 'password' , '' )

if( !auth( password.text() ) ) {
    // AuthError: Wrong password
    // \default
    // ./config.tree#5:3-11
    throw password.error( 'Wrong password' , AuthError )
}
Enter fullscreen mode Exit fullscreen mode

Processing Tree

Or another example - we decided that "auth" is an unfortunate name and we need to replace it with "credentials". Therefore, we write a simple script for automatic refactoring:

// server credentials
//  login \root
//  password \qwerty
const new_config = config.list(
    config.hack({

        'auth' : ( auth , context )=> [
            auth.struct( 'credentials' , auth.hack( context ) ),
        ] ,

    })
)
fs.writeFileSync( config_path , new_config )
Enter fullscreen mode Exit fullscreen mode

And in this way you can easily refactor any languages ​​based on the tree format without searching for a separate parser for each language and dealing with how it works with AST.

Support by editors

If you are using an editor for which there is no plugin yet, then this is a good opportunity to implement it. This will be easier to do than for any other language.

Language support

Again, I encourage those who are interested to implement support in their favorite language and try to put it to good use.

Results

XML JSON YAML TOML Tree
Size 195% 140% 125% 110% 100%
Number of patterns 90 30 210 90 10
Unambiguous syntax βœ… βœ… ❌ βœ… βœ…
Readability ❌ ❌ βœ… βœ… βœ…
No escaping needed ❌ ❌ ❌ ❌ βœ…
Exact coordinates of nodes ❌ ❌ ❌ ❌ βœ…
Streaming ❌ ❌ βœ… βœ… βœ…
Extensible data model βœ… ❌ βœ… ❌ βœ…
Widespread βœ… βœ… βœ… ❌ ❌

Ideas

And now let's dream up what else interesting things can be done using the tree format.

  • Requests to the DBMS
  • Domain description
  • Logging
  • Communication of console utilities
  • LISP-like language
  • Universal AST

sql.tree - queries to the DBMS

Remember those clumsy MongoDB queries? Let's try to write our SQL:

select
    from $users
    fetch
        @name
        @phone
        @photo *
            @uri
            @width
            @height
    where or
        and
            @sex = female
            @age > 16
        @hobby ~ \\b(?:java|type)script\b
Enter fullscreen mode Exit fullscreen mode

Parsing the query in this form is a breeze, unlike real SQL. Please note that there is a uniform syntax for logical operations and predicates "is equal to", "greater than" and even "matches the regular expression". By the way, the regular expression can also be described in the tree format, which will make it much more supported.

select
    from $users
    fetch *
    where @hobby ~
        word-edge
        or
            \java
            \type
        \script
        word-edge
Enter fullscreen mode Exit fullscreen mode

domain.tree - description of the domain

Since we are talking about databases. This is how I describe the domain model.

hyoo_api_person
    descr \Live service user
    inherit hyoo_api_entity
    field
        id
            descr \Unique human readable identifier
            example \person=jin
            key unique
            type text
            edit author
        avatar
            descr \Links to avatars
            type list hyoo_api_image
            edit author
        mail
            descr \Attached emails
            type set hyoo_api_mail
Enter fullscreen mode Exit fullscreen mode

From such a formal description, a server API, ACL rules, a DBMS scheme and an admin panel are automatically generated to manage the whole thing.

Logs

A common practice is to output single-line messages to the logs. As long as they fit in the width of your terminal - everything is fine, but this is a rather rare situation. Much more often, messages still do not fit and begin to be transferred, turning the flow of messages into a real mess, which is difficult to read with your eyes, and even programmatically process them - pain and suffering.

log.tree - structured logs

But what if the logs are immediately displayed in a two-dimensional form, at the same time easily readable by both machines and humans?

193.34.12.132 - - [2011-10-20T12:46:08+04:00] GET /nin-jin/slides/edit/master/t
ree/readme.md HTTP/1.1 200 4435
193.34.12.132 - - [2011-10-20T12:46:09+04:00] GET /nin-jin/slides/edit/master/t
ree/readme.html HTTP/1.1 404 4435


access
    ip \193.34.12.132
    time \2011-10-20T12:46:08+04:00
    method \GET
    uri \/nin-jin/slides/edit/master/tree/readme.md
    protocol \HTTP/1.1
    response \200
    size \4435
Enter fullscreen mode Exit fullscreen mode

The lower code is clearer. Isn't it?

tree-tools - CLI tree processing utilities

You can write utilities that would allow you to simply and efficiently process such logs. For example, we will read the log, filter by the value of one of the fields, select from the messages only fields that are interesting to us and display them as a sign.

> cat access.log.tree | pick ip time method uri | table

\193.34.12.132 2011-10-20T12:46:08+04:00 GET /index.html
\193.34.12.132 2011-10-20T12:46:10+04:00 GET /index.css
\193.34.12.132 2011-10-20T12:46:20+04:00 GET /index.js

> cat access.log.tree | filter time >= 2019-09 | pick ip uri | table

\193.34.12.132 /index.html
\193.34.12.132 /index.css
\193.34.12.132 /index.js
Enter fullscreen mode Exit fullscreen mode

I have a prototype of such utility which I sometimes use to view live dev server logs. It will be great if someone undertakes to implement a complete set of tools. And when there are tools, then software developers will be motivated to write logs not randomly, but in a structured way.

tree as a communication protocol

You can go further and not just write logs in tree format, but in principle promote the idea that the output of any program should be structured. Many utilities have flags for outputting a response in the form of JSON or XML, but reading such an output is stressful for a person - you have to reopen the output in visual representation tools in order to understand what is returned there and how to approach it. Just imagine a world where the output can be read and immediately somehow transformed without picking mana in search of the desired combination of keys for the next program.

> gitlog

commit
    message \$mol_style: TS@3.9 compatibility
    sha \b1a8f07c839604d0d34430a186246f0c1f71e628
    date \2020-05-15T23:24:32+0300
    author \nin-jin <sairi-na-tenshi@ya.ru>
commit
    message \$mol_regexp: concurrent parse ability
    sha \be1abfa50542728dd5c156517ea31f469e7fb4d4
    date \2020-05-15T23:03:30+0300
    author \nin-jin <nin-jin@ya.ru>

> git log | pick date message | table

\2020-05-15T23:24:32+0300 $mol_style: TS@3.9 compatibility
\2020-05-15T23:03:30+0300 $mol_regexp: concurrent parse ability
Enter fullscreen mode Exit fullscreen mode

WAT

WebAssembly is a forward-thinking assembler that gets as close to the machine as possible without sacrificing portability. It has a text representation format based on Lisp s-expressions.

(func $fact (param $x i64) (result i64)
    (if $x (result i64)
      (i64.eqz
        (local.get $x))
      (then
        (i64.const 1))
      (else
        (i64.mul
          (local.get $x)
          (call $fact      
            (i64.sub
              (local.get $x)
              (i64.const 1)))))))
Enter fullscreen mode Exit fullscreen mode

It is difficult to perceive it no matter how you format it. Unfortunately, this is the kind of code you will see when disassembling in browser devtools.

wasm.tree - assembler without tinsel

I'm currently working on a bytecode compiler for a more descriptive wasm.tree description.

func
    $fact
    param $x i64
    result i64
    body switch
        test i64.eqz local.get $x
        then i64.const 1
        else i64.mul
            local.get $x
            call $fact i64.sub
                local.get $x
                64.const 1
Enter fullscreen mode Exit fullscreen mode

From this assembler, a list of bytecodes in the [bin.tree] language (https://github.com/nin-jin/tree.d/wiki/bin.tree) is generated, which is already distilled into a binary by an elementary function.

00
61
73
6d
01
00
00
00
.
.
.
Enter fullscreen mode Exit fullscreen mode

When there is something more or less complete, I will try to push this syntax as WAT2.0. Who cares about the fate of WebAssembly - join the development.

jack.tree - LISP without brackets

In fact, writing in raw assembler is too verbose. Therefore, the next step is the implementation of a meta-language that allows you to extend the language by means of the same language itself. The core of such a language should turn out to be extremely minimalistic, and all idioms will be connected to it as third-party libraries written in the same language.

jack
    import wasm
    tree func $fact
        > $x #8
        < #8 switches
            test is-zero $x
            then #8 1
            else mul
                $x
                $fact sub
                    $x
                    #8 1
Enter fullscreen mode Exit fullscreen mode

Roughly speaking, a program in this language iteratively modifies its own AST in such a way that the output is a wasm binary. It may sound intimidating, but thanks to the fact that tree saves the coordinates of the sources, it is not difficult to trace the source of the error. In the repository, you can look at a scanty prototype.

$mol_jack

Abolishing LLVM

You can go even further and generate not wasm bytecodes, but downright bytecodes of the target processor, simply by adding one more transformer to the pipeline.

compile pipelines:

                jack.tree => wasm.tree =============> bin.tree
                jack.tree => wasm.tree => arm.tree => bin.tree
any-dsl.tree => jack.tree => wasm.tree => arm.tree => bin.tree
Enter fullscreen mode Exit fullscreen mode

At the same time, at any level, you can run additional transformers that can optimize the code using the information available at the corresponding levels of abstraction.

optimization middlewares:

jack.tree => jack.tree
wasm.tree => wasm.tree
arm.tree => arm.tree
Enter fullscreen mode Exit fullscreen mode

At the same time, let me remind you that we do not lose touch with the original sources, which will allow us to display adequate messages. And any intermediate AST can always be dumped into text in a very visual form of the tree format.

Again, join the development, it can turn out to be a cool thing to replace LLVM.

One AST to rule them all

And finally, we come to the main idea of ​​this report. Tree is a perfect candidate for a universal AST binder. Just look at how long the TypeScript code goes from source to the resulting bundle when building on a typical project.

code =(P)=> loader =(P)=> compiler =(SP)=> bundler =(SP)=> terser =(S)=> bundle

P - Parse
S - Serialize
Enter fullscreen mode Exit fullscreen mode

And each tool re-parses your sources into its own AST, processes it, serializes it, and passes it on. If we agree on a single AST format, then we can significantly simplify the implementation of utilities and reduce the overhead for code processing.

code =(P)=> loader =====> compiler ======> bundler ======> terser =(S)=> bundle
Enter fullscreen mode Exit fullscreen mode

Even if some of the utilities will run in separate processes (which means intermediate serialization is inevitable), the tree format will allow you to transfer the AST as quickly as possible, due to the minimum overhead for parsing and serialization.

Sandbox

tree.hyoo.ru - a sandbox where you can drive various transformations. Here are some examples:

  • view.tree β‡’ view.ts - translation of the component description into TypeScript code.
  • view.tree β‡’ locale.json - export of reference texts for localization in the form of JSON from the component description.
  • view.tree β‡’ view.dts - export TypeScript types with embedded sorsmaps from component descriptions.
  • JSON β‡’ json.tree - translation of JSON into json.tree.
  • xml.tree β‡’ XML - translation of xml.tree into XML
  • XML β‡’ xml.tree - translation of XML into xml.tree.
  • js.tree β‡’ JS - translation of JavaScript AST into JavaScript proper.
  • wasm.tree β‡’ WASM - compilation of WASM AST into a WASM binary and checking its correctness. This thing is still very raw: only 3 types of sections are supported, you can't run it right there in the sandbox. But as soon as there is time, I will finish the specification.
  • jack.tree β‡’ JS eval is a translation of a meta-language with JavaScript generation with built-in sorsmaps and immediately its execution.
  • MarkedText β‡’ JS - translation of MarkedText into JavaScript code with embedded sorsmaps, which generates a DOM tree using the DOM API.
  • grammar.tree check - grammar correctness check.tree syntax descriptions on the fly.
  • span.tree imprint/reuse - stitching of sources and mapping in span.tree tree, its intermediate serialization into a string, followed by restoration of the original tree without loss of mapping.
  • automate.tree (JS) is an example of writing your own transformation in JavaScript that converts a simple automation script into JavaScript code with built-in sorsmaps.
  • automate.tree (jack) is the same, but using the jack.tree language.

Where to go, where to go

I hope I managed to infect you with ideas about a brighter future. But in order to bring it closer, we need to work on it together. I'm afraid I won't be able to handle all of this. So write, call and do not disappear.

Top comments (0)