Originally posted on crunchingnumbers.live
In the past 2 weeks, I tasked myself with helping users create and visualize LDAP filters. An LDAP filter allows users to make a search in a powerful manner, and can take any of these forms:
Problem description
We can parse the easy examples above with regex (regular expression). Unfortunately, an LDAP filter can also be a combination of LDAP filters, joined by the and (&
), or (|
), or not (!
) operator.
LDAP filters, which exhibit recursion, are an example of context-free grammar. Regex, which stems from regular grammar (a subclass of context-free), can’t handle recursion on its own. We, as developers, would need to implement it and risk creating bugs.
With Nearley, we can easily define a context-free grammar. In fact, it helps us solve 5 problems altogether.
a. Recognition
The first problem, recognition, is the very purpose of tools like regex and Nearley. Given a string, decide whether it is an LDAP filter.
b. Classification
The second problem, classification, means we know why a string is an LDAP filter.
c. Error Handling
Conversely, we want to tell users why a string is not an LDAP filter.
d. Transformation
Instead of simple true
or false
, the output should have rich, easy-to-digest structure and content. We can then use the output to help users visualize their input.
e. Maintainability
Perhaps most importantly, we must be able to easily understand the grammar. Not just today when we write it, but at any time when we return to it.
Notice how the code is self-documenting. Nearley can even draw railroad diagrams to help us visualize our grammar:
In comparison, here’s regex:
2. Example
We will look at the LDAP filter as a concrete, real-life application of Nearley. The full definition can be found in RFC 2254 and RFC 4515.
For simplicity, we define an LDAP filter to be either a simple or present expression. Let’s march the definition in top-down fashion:
In the context of grammar, the elements in white are called terminals, while those in yellow are called nonterminals. Terminals are like atoms; they are the smallest unit that can’t be further divided (subatomic particles don’t exist in this analogy). Nonterminals are like molecules and compounds; they are built upon the terminals through rules.
3. Code
We will use Ember (3.7.0) to house the grammar. This way, we can write regression tests and confidently make changes to the grammar.
By the end, we will have created and modified 6 files:
Folder structure
.
|-- app
| └-- utils
| └-- ldap-builder.js
|
|-- tests
| └-- unit
| └-- utils
| └-- ldap-builder-test.js
|
|-- vendor
| └-- ldap-filter-grammar.ne
|
|-- compile-ldap-filter-grammar.sh
|
|-- ember-cli-build.js
|
└-- package.json
a. Ember
First, install Nearley and Browserify globally, so that we can compile the grammar file, ldap-filter-grammar.ne.
Terminal: /
npm install -g nearley
npm install -g browserify
Then, we write a script to help with compilation:
File: /compile-ldap-filter-grammar.sh
nearleyc ./vendor/ldap-filter-grammar.ne -o ./vendor/ldap-filter-grammar-temp.js
browserify ./vendor/ldap-filter-grammar-temp.js -o ./vendor/ldap-filter-grammar.js -s ldap-filter-grammar
rm ./vendor/ldap-filter-grammar-temp.js
The compiled file, ldap-filter-grammar.js, must live in the vendor folder. To tell Ember that this file exists, we need to update ember-cli-build.js.
File: /ember-cli-build.js
'use strict';
const EmberApp = require('ember-cli/lib/broccoli/ember-app');
module.exports = function(defaults) {
let app = new EmberApp(defaults, {
// Add options here
});
app.import('vendor/ldap-filter-grammar.js', {
using: [
{ transformation: 'amd', as: 'ldap-filter-grammar' },
],
});
return app.toTree();
};
Finally, we create a utility file that uses our grammar to parse an incoming string. To do so, we also need to install Nearley locally.
Terminal: /
npm install nearley --save-dev
ember install ember-auto-import
ember g util ldap-builder
File: /app/utils/ldap-builder.js
import ldapFilterGrammar from 'ldap-filter-grammar';
import nearley from 'nearley';
export default {
createLdapObject(ldapFilter) {
try {
const parser = new nearley.Parser(nearley.Grammar.fromCompiled(ldapFilterGrammar));
parser.feed(ldapFilter);
const results = parser.results;
// If there is a match, return the first result
if (results.length > 0) {
return results[0];
}
} catch (error) {
// If there is no match, return false
return false;
}
return false;
},
};
We can then run tests against our grammar:
Terminal: /
ember t --server --filter='Unit | Utility | ldap-builder'
b. Nearley
Writing a grammar is easy in Nearley. We copy-paste the code that was shown in the top-down approach. Nearley uses a syntax that is based on BNF (Backus-Naur Form). If the grammar to your application also uses BNF (e.g. LDAP filter), you are in luck!
File: /vendor/ldap-filter-grammar.ne
# Define an LDAP filter
filter ->
"(" expression ")" {%
(data) => data[1]
%}
expression ->
simple {% id %} |
present {% id %}
# Define a simple expression
simple ->
attribute operator value {%
(data) => {
const attribute = data[0];
const operator = data[1];
const value = data[2];
// Handle success
return {
expression: `(${attribute}${operator}${value})`,
metadata: {
expressionType: 'simple',
attribute,
operator,
value,
},
};
}
%}
attribute ->
[\w]:+ {%
(data) => data[0].join('')
%}
operator ->
"=" {% id %} |
"~=" {% id %} |
"<=" {% id %} |
">=" {% id %}
value ->
[\w]:+ {%
(data) => data[0].join('')
%}
# Define a presence expression
present ->
attribute "=*" {%
(data) => {
const attribute = data[0];
const operator = '=';
const value = '*';
// Handle success
return {
expression: `(${attribute}${operator}${value})`,
metadata: {
expressionType: 'present',
attribute,
operator,
value,
},
};
}
%}
I think most of the code is self-explanatory. The {%
and %}
, which follow a rule, mark a postprocessor. A postprocessor transforms the outputs of incoming data (terminals or nonterminals) to define the output of the outgoing nonterminal. The best part? We write postprocessors in JavaScript, so we can use methods like map
, join
, and reduce
to transform arrays.
Speaking of arrays, the default match output of Nearley is an array. For example, if the rule for op
matches because the input had '~='
, Nearley returns the array ['~=']
. While arrays allow Nearley to handle ambiguity in a grammar and present all matches, we may be more interested in the content of an array so that we can pass it to the next rule. To do so, we can use the built-in postprocessor, id
(stands for identity, I believe).
If we want to combine the contents of an array, we can access each content using the array index. Since a postprocessor is just JavaScript, we can even use destructuring:
simple ->
attribute operator value {%
(data) => {
const [ attribute, operator, value ] = data;
[...]
}
%}
present ->
attribute operator value {%
([ attribute, operator, value ]) => {
[...]
}
%}
4. Best Practices
Take it from me, working on LDAP filters. The spec is strict on spaces, but I had liberally allowed them so that the user could enter expressions like (id >= 123)
with spaces in-between. While this had increased usability, I had falsely marked expressions such as cn=Isaac Lee
as not valid. Fixing this cost 2 days. Start right.
Nearley creates an array as default output, but allows you to customize the output using postprocessors. If this is your first time writing context-free grammar, you may be unsure of your output. Write tests to verify it. You can also use existing tests for regression testing, so that you can confidently write additional grammar.
Now that the rules are in place, you can think about relaxing them. This can introduce ambiguity (decrease performance), but increase the user experience. For example, by making attribute
, operator
, and value
optional, I can detect the missing element and alert the user.
simple ->
attribute:? operator:? value:? {%
(data) => {
[...]
}
%}
First, check that existing tests (for the correct rules) are still passing. Afterwards, write additional tests for the relaxed rules.
My public repo features 5 tests. In case the small number discourages you from writing tests for your application, I want to mention that the full version has over 220 tests. It’s why I feel confident in my LDAP solution.
Lastly, take the time to refactor your code. This may mean renaming variables to more descriptive ones, creating helper functions to remove repeated code, and updating rules to reduce or eliminate ambiguity. By doing so, your grammar will become more maintainable.
Notes
For more information on Nearley, I encourage you to visit these sites:
- Official Documentation
- Parsing Absolutely Anything in JavaScript Using Earley Algorithm — by Gajus Kuizinas
- Nearley Parser Playground
To learn the theory behind regular and context-free grammars, you can check out James Hein’s “Discrete Structures, Logic, and Computability.”
Lastly, you can read on LDAP filters at LDAP.com (more readable than the RFCs linked above).
You can find the code in its entirety here:
Top comments (2)
This seems super useful
Thanks, Ben. Yeah, definitely. Learning how to use Nearley was also a lot of fun.