DEV Community

Cover image for Useful Chompers
Dwayne Crooks
Dwayne Crooks

Posted on • Edited on

Useful Chompers

There are useful chompers, not defined in elm/parser, that can help you in the lexical analysis stage of your parser.

What are chompers?

Chompers or the "chomping" family of functions are special functions, that return Parser (), that allow you to validate a sequence of characters. elm/parser defines two such functions called chompIf and chompWhile but there are many others, for e.g. chompOptional, chompOneOrMore, chompExactly, chompAtLeast, chompAtMost, and chompBetween, that can be useful to define as well.

What is lexical analysis?

A parser is usually separated into two stages. The first stage does the lexical analysis, or token generation, by which the input characters are split into meaningful symbols defined by a grammar of regular expressions.

For e.g. the meaningful symbols in the context-free grammar for spreadsheet formulas are Text, Decimal, Coord, and Identifier.

Formula     ::= '=' Expr | Text
Expr        ::= Number
              | Cell
              | Range
              | Application
Number      ::= Decimal
Cell        ::= Coord
Range       ::= Coord ':' Coord
Application ::= Identifier '(' ( Expr ( ',' Expr )* )? ')'

/* Lexemes */

Text       ::= [^=] .*
Decimal    ::= '-'? [0-9]+ ('.' [0-9]*)?
Coord      ::= [A-Z] ( '0' | [1-9] [0-9]? )
Identifier ::= [a-zA-Z_] [a-zA-Z0-9_]*
Enter fullscreen mode Exit fullscreen mode

Lexical analysis with chompers

You can use chompers to write the equivalent of any regular expression. As a result, chompers become pretty handy for lexical analysis.

For e.g. the lexemes Text, Decimal, Coord, and Identifier are defined by regular expressions in the grammar above. This means we can rewrite them in elm/parser using a combination of chompers. Decimal could be defined as follows:

decimal : Parser Float
decimal =
    chompDecimal
        |> P.mapChompedString (\s _ -> String.toFloat s |> Maybe.withDefault 0)
        |> lexeme


chompDecimal : Parser ()
chompDecimal =
    let
        decimalPart =
            P.succeed ()
                |. P.chompIf ((==) '.')
                |. P.chompWhile Char.isDigit
    in
    P.succeed ()
        |. chompOptional ((==) '-')
        |. chompOneOrMore Char.isDigit
        |. chompZeroOrOne decimalPart
Enter fullscreen mode Exit fullscreen mode

Read the Lexer.elm module to see how the rest of them could be defined.

N.B. Assume elm/parser is installed and the following imports have been made available in the code snippets below.

import Parser as P exposing ((|.), (|=), Parser)
Enter fullscreen mode Exit fullscreen mode

chompOptional

It validates zero or one occurrences of the matching characters. It is equivalent to the question mark ? in regular expressions.

chompOptional : (Char -> Bool) -> Parser ()
chompOptional isGood =
    P.oneOf
        [ P.chompIf isGood
        , P.succeed ()
        ]
Enter fullscreen mode Exit fullscreen mode

For e.g. we can define optionalSign ::= ( '+' | '-' )? as follows:

chompOptionalSign : Parser ()
chompOptionalSign =
    chompOptional (\c -> c == '+' || c == '-')
Enter fullscreen mode Exit fullscreen mode

Check out a full example.

chompOneOrMore

It validates one or more occurrences of the matching characters. It is equivalent to the plus sign + in regular expressions.

chompOneOrMore : (Char -> Bool) -> Parser ()
chompOneOrMore isGood =
    P.succeed ()
        |. P.chompIf isGood
        |. P.chompWhile isGood
Enter fullscreen mode Exit fullscreen mode

For e.g. we can define natural ::= [0-9]+ as follows:

chompNatural : Parser ()
chompNatural =
    chompOneOrMore Char.isDigit
Enter fullscreen mode Exit fullscreen mode

Check out a full example.

chompExactly

chompExactly n is equivalent to {n} in regular expressions. It validates that the matching characters occurred exactly n times.

Here's a straightforward but slow implementation:

chompExactlySlow : Int -> (Char -> Bool) -> Parser ()
chompExactlySlow n isGood =
    if n <= 0 then
        P.succeed ()

    else
        -- if n > 0 then
        P.succeed ()
            |. P.chompIf isGood
            |. chompExactlySlow (n - 1) isGood
Enter fullscreen mode Exit fullscreen mode

And, here's a faster implementation that uses loop:

chompExactlyFast : Int -> (Char -> Bool) -> Parser ()
chompExactlyFast n isGood =
    P.loop n
        (\i ->
            if i <= 0 then
                P.succeed <| P.Done ()

            else
                P.chompIf isGood
                    |> P.andThen (\_ -> P.succeed (P.Loop (i - 1)))
        )
Enter fullscreen mode Exit fullscreen mode

I ran some benchmarks and it turns out that chompExactlyFast is about 70% faster than chompExactlySlow.

The results of a benchmark for chompExactly

As an example, you can use chompExactly to help you parse ZIP+4 formatted zip codes, where zipCode ::= [0-9]{5} '-' [0-9]{4}.

type alias ZipCode =
    { basic : String
    , geoSegment : String
    }

zipCode : Parser ZipCode
zipCode =
    P.succeed ZipCode
        |= nDigits 5
        |. P.chompIf ((==) '-')
        |= nDigits 4

nDigits : Int -> Parser String
nDigits =
    P.getChompedString << chompNDigits

chompNDigits : Int -> Parser ()
chompNDigits n =
    chompExactly n Char.isDigit
Enter fullscreen mode Exit fullscreen mode

Check out a full example.

chompAtLeast

chompAtLeast min is equivalent to {min,} in regular expressions. It validates that the matching characters occurred min or more times.

chompAtLeast : Int -> (Char -> Bool) -> Parser ()
chompAtLeast min isGood =
    P.succeed ()
        |. chompExactly min isGood
        |. P.chompWhile isGood
Enter fullscreen mode Exit fullscreen mode

Check out some examples of chompAtLeast.

chompAtMost

chompAtMost max is equivalent to {, max} in regular expressions. It validates that the matching characters occurred up to max times.

chompAtMost : Int -> (Char -> Bool) -> Parser ()
chompAtMost max isGood =
    P.loop 0
        (\i ->
            if i < max then
                P.oneOf
                    [ P.chompIf isGood
                        |> P.andThen (\_ -> P.succeed (P.Loop (i + 1)))
                    , P.succeed <| P.Done ()
                    ]

            else
                P.succeed <| P.Done ()
        )
Enter fullscreen mode Exit fullscreen mode

Firstly, notice that chompAtMost max always succeeds just like chompWhile. The only difference is that it is bounded by max. While i < max we try to consume the next character, P.chompIf isGood. But, if it fails that's fine since it means we've matched all the characters we could match up to this point, i.e. i of them. That's why we stop with success. In the case, i == max, we've match the most we're allowed to match so again we stop with success.

Check out some examples of chompAtMost.

chompBetween

chompBetween min max is equivalent to {min, max} in regular expressions. It validates that the matching characters occurred at least min times, but not more than max times.

chompBetween : Int -> Int -> (Char -> Bool) -> Parser ()
chompBetween min max isGood =
    if min > max then
        P.succeed ()

    else
        let
            l =
                Basics.max min 0

            h =
                Basics.max l max
        in
        chompExactly l isGood
            |> P.andThen (\_ -> chompAtMost (h - l) isGood)
Enter fullscreen mode Exit fullscreen mode

Given, 0 < min <= max, then to implement chompBetween min max we first try to match exactly min characters. Then, the most characters we're allowed to match after that is max - min.

Check out some examples of chompBetween.

Conclusion

Chompers work at the character level just like regular expressions. They can be used to describe the lexemes or tokens of your grammar. Because of this, I recommend putting your chomping functions in a Lexer.elm module and having a Parser.elm module that depends on it.

You can view an example of my approach in my formula lexer and formula parser (here's the grammar) for the 7GUIs Cells task.

For now I'm just collecting useful chompers in dwayne/elm-chompers and some day I may turn it into an Elm package, but for now I'm still exploring. Feel free to copy and paste these chompers into your own projects if you find them useful.

Top comments (3)

Collapse
 
nano7 profile image
Nano

Thank you for the excellent article.

One question has been bothering me for a long time.

Just to give you contrived example, given string <h1>Title: Something</h1> how do I capture Title: Something between the tags equivalent to regex >(.*?)<.

Collapse
 
dwayne profile image
Dwayne Crooks

For this particular case you can do something like the following:

h1Parser : P.Parser String
h1Parser =
    P.succeed identity
        |. P.token "<h1>"
        |= P.getChompedString (P.chompWhile (\c -> c /= '<'))
        |. P.token "</h1>"
Enter fullscreen mode Exit fullscreen mode

Here's the full example.

Collapse
 
minibill profile image
Leonardo Taglialegne • Edited

The picture 🤩

To be clear, I also liked the article, but I felt that the picture deserved a special mention ^_^