I'm working on an OSS project called AppMap for VS Code which records execution traces of test cases and running programs. It emits JSON files, which can then be used to automatically create dependency maps, execution trace diagrams, and other interactive diagrams which are invaluable for navigating large code bases. Here's an example using Solidus, an open source eCommerce app with over 23,000 commits!
Each AppMap file can range from several kilobytes up to 10MB. AppMap has been used with projects up to 1 million lines of code, and over 5,000 test cases (each test case produces an AppMap). You can imagine, a lot of JSON is generated! I’m working on a new feature that uses AppMaps to compare the architecture of two different versions of an app, so I need to efficiently process a lot of JSON as quickly as possible.
In this article I’m going to present a few of the obstacles I encountered while processing all this JSON using Node.js, and how I resolved them.
Getting Asynchronous
Let’s start with the basics. The built-in asynchronous nature of JavaScript means that our programs can do useful work with the CPU while simultaneously performing I/O. In other words, while the computer is communicating with the network or filesystem (an operation which doesn't keep the CPU busy), the CPU can be cranking away on parsing JSON, animating cat GIFs, or whatever.
To do this in JavaScript, we don't really need to do anything special, we just need to decide how we want to do it. Back in the day, there was only one choice: callback functions. This approach was computationally efficient, but by default the code quickly became unreadable. JavaScript developers had a name for this: “callback hell”. These days, the programming model has been simplified with Promises, async
and await
. In addition, the built-in fs
module has been enhanced with a Promises-based equivalent, fs/promises
. So, my code uses fs/promises
with async
and await
, and it reads pretty well.
loadAppMaps
const fsp = require('fs').promises;
// Recursively load appmap.json files in a directory, invoking
// a callback function for each one. This function does not return
// until all the files have been read. That way, the client code
// knows when it's safe to proceed.
async function loadAppMaps(directory, fn) {
const files = await fsp.readdir(directory);
await Promise.all(
files
.filter((file) => file !== '.' && file !== '..')
.map(async function (file) {
const path = joinPath(directory, file);
const stat = await fsp.stat(path);
if (stat.isDirectory()) {
await loadAppMaps(path, fn);
}
if (file.endsWith('.appmap.json')) {
const appmap = JSON.parse(await fsp.readFile(filePath));
fn(appmap);
}
})
);
}
Bonus material: A note about Promise.all
and Array.map
An async
function always returns a Promise, even if nothing asynchronous actually happens inside of it. Therefore, anArray.map(async function() {})
returns an Array of Promises. So, await Promise.all(anArray.map(async function() {}))
will wait for all the items in anArray
to be processed. Don't try this with forEach
! Here's a Dev.to article all about it.
Asynchronous processing is so ubiquitous in JavaScript that it would easy to think there’s no downside. But consider what happens in my program when there are thousands of large AppMap files. Is a synchronous world, each file would be processed one by one. It would be slow, but the maximum memory required by the program would simply be proportional to the largest JSON file. Not so in JavaScript! My code permits, even encourages, JavaScript to load all of those files into memory at the same time. No bueno.
What to do? Well, I had to do some actual work in order to manage memory utilization. Disappointing, in 2021, but necessary. (Kidding!)
Keeping a lid on things, with Async
When I was writing an LDAP server in Node.js back in 2014 (true story), there was this neat little library called Async. This was before the JavaScript Array class had helpful methods like map
, reduce
, every
, so Async featured prominently in my LDAP server. Async may not be as essential now as it used to be, but it has a very useful method mapLimit(collection, limit, callback)
. mapLimit
is like Array.map
, but it runs a maximum of limit
async operations at a time.
To introduce mapLimit
, most of loadAppMaps
was moved into listAppMapFiles
.loadAppMaps
became:
async function loadAppMaps(directory) {
const appMapFiles = [];
await listAppMapFiles(directory, (file) => {
appMapFiles.push(file);
});
return asyncUtils.mapLimit(
appMapFiles,
5,
async function (filePath) {
return JSON.parse(await fsp.readFile(filePath))
}
)
);
}
Loading 5 files concurrently seems like enough to get the benefits of async processing, without having to worry about running out of memory. Especially after the next optimization...
Parsing just what's needed, with Oboe.js
I mentioned that I'm computing the "diff" between two large directories of AppMaps. As it happens, I don't always need to read everything that's in an AppMap JSON file; sometimes, I only need the "metadata".
Each AppMap looks like this:
{
"version": "1.0",
"metadata": { ... a few kb ... },
"class_map": { ... a MB or so... },
"events": [ potentially a huge number of things ]
}
Almost all of the data is stored under the events
key, but we only need the metadata
. Enter:
Streaming means, in this case, "a bit at a time".
The Oboe.js API has two features that were useful to me:
- You can register to be notified on just the JSON object keys that you want.
- You can terminate the parsing early once you have what you need.
The first feature makes the programming model pretty simple, and the second feature saves program execution time. The streaming nature of it ensures that it will use much less memory than JSON.parse
, because Oboe.js will not actually load the entire JSON object into memory (unless you force it to).
My use of Oboe looks something like this:
function streamingLoad(fileName, metadata) {
return new Promise(function (resolve, reject) {
oboe(createReadStream(fileName))
.on('node', 'metadata', function (node) {
metadata[fileName] = node;
// We're done!
this.abort();
resolve();
})
.fail(reject);
}
Wrap-up
So, that's the story. To recap:
-
fs/promises
gives you a nice modern interface to Node.jsfs
. -
Async.mapLimit
prevents too much data from being loaded into memory at the same time. -
Oboe
is a streaming JSON parser, so we never have the whole document loaded into memory.
I haven't optimized this for speed yet. My main concern was making sure that I didn't run out of memory. When I profile this, if I find any useful performance speedups, I will write those up to. You can follow me on this site to be notified of future articles!
While you're here...
State of Architecture Quality Survey
My startup AppLand is conducting a survey about software architecture quality. To participate in the survey, visit the State of Software Architecture Quality Survey. Thanks!
Top comments (0)