Okay, so I've started working on a project that needs to handle a lot of traffic, and by a lot, I mean millions of concurrent connections. So there's a couple of things I wanted to validate: What is the best web server framework? And what is the most efficient DB for handling this kind of traffic? This article covers the latter.
Due to the nature of data that we'll be dealing with the obvious choice is a NoSQL database. Now, I don't know about you, but my go-to choices for NoSQL data storage tend to be either MongoDB (when I just need a document store) or CouchDB / Cloudant (when I want to do relatively inexpensive MapReduce or data transformations on the fly). I tend to avoid key-value stores like Memcached or Redis, but this is due to interface requirements, rather than some ideological conflict.
Also, while I'm well aware that PostgreSQL likes to try and blur the lines, presenting itself as some kind of Swiss Army-like multi-tool, I find the syntax of working with it as a NoSQL database frankly ugly and cumbersome, so we're not even going there.
Recently, however, I was reading about LMDB (the Lightning Memory-Mapped Database), so it got me wondering: Can I build an application-specific DB on top of LMDB that outperforms MongoDB in the real world without choking myself in excessive complexity?
NoSQL databases
I'm going to presume that the majority of people reading this article don't need me to explain what a NoSQL database is, and how NoSQL databases differ from Relational databases (A.K.A. SQL databases).
But for those new to the subject, a NoSQL database is a type of data store that is commonly designed for retrieving and manipulating simple objects that don't have inter-structure relational data. That is to say, objects (or documents) of one data structure have no dependency relationships to objects of another data structure. Whereas a relational database, as the name suggests, is designed to store data that has one or more relations between the data structures being stored. To this end, Relational Databases tend to store data in tables, where each object is stored as a row. Whereas NoSQL databases tend to store data as either a Document or as a Key-Value pair.
This difference is, less consequential these days as the majority of developers prefer to use ORMs (Object Reference Models) to abstract away the underlying database interface, and as a result fewer people need to know the Structured Query Language that NoSQL databases are so named for not having. The majority of modern ORMs for NoSQL databases can, in fact, provide relational lookups for those so inclined to use a hammer for tightening bolts. Needless to say, choosing the right tool for the job is a significant part of any job.
The astute amongst the readership will perhaps also be aware of Graph Databases which are a completely different breed altogether. However, these are more of a tangent to the thread than this brief explainer already is. So I'm drawing the line at explaining the difference between SQL databases and NoSQL databases for now.
The rules of engagement
Obviously, there are some significant caveats here since LMDB is a Key-Value store, and MongoDB is a Document Store.
So for to make this less of an Apples to Oranges comparison, let's define a few rules:
- Both solutions will sit behind a web server that will accept and serve JSON Objects that comply with a simple specification
- Both solutions will use the same web server framework that will be responsible for any data format conversions needed
- Both solutions will have access to the equal system resources
The JSON Objects
An example of the JSON object we're going to POST to the web server will be the following:
{
"position": {
"lat": 40.5340,
"lon": -74.3778
},
"radius": 39.66365,
"cidr": "192.0.2.0/24",
"country_code": "US",
"nearest_city": "Edison",
"timestamp": 1623773290,
"software_source_version": "0.0.1"
}
An example of the JSON object that we expect to retrieve will be formatted as follows:
{
"cidr": {
"bitmask": 24,
"net_addr": {
"ipv4": "196.72.0.0",
"ipv6": "ffff:ffff:ffff:ffff:ffff:ffff:ffff:ffff",
"version": 4
}
},
"country_code": "US",
"db_version": {
"major": 0,
"minor": 1,
"patch": 0
},
"nearest_city": "Edison",
"position": {
"lat": 40.5185,
"lon": -74.3515
},
"radius": 39.66365,
"software_source_version": {
"major": 0,
"minor": 0,
"patch": 1
},
"timestamp": 1623773290
}
However, as MongoDB stores data in BSON (Binary JSON) format and LMDB simply stores a block of bytes, we're forced to make a decision about what data structure to use. We could use a suitable BSON library to minimise the difference, but this would introduce an extra variable to the equation, as the efficiency of the BSON library used could affect the results. I'd also be unlikely to use BSON in production as I rarely need that level of flexibility in my data structures. So in this case I chose to use Python's CTypes to build out a suitable set of data structures, constructing a comparable data model to the one I used for the MongoDB data.
Software Drag Racing
With all this in hand, it was time to flood both servers with traffic to see which of our NoSQL Databases won. It was a battle for our time; a classic David and Goliath. The Key-Value Store beneath the hood of OpenLDAP vs The world's most popular Document Database. Dynamic Schema vs hardcoded. Who will win?
Machine Specs
For these tests, I used MongoDB v4.4.6, LMDB v0.9.29, Python3.7.3, and Flask 2.0.1 on an AMD Ryzen 7 3800X with 16GiB of RAM and a Corsair MP510 SSD. As there should be plenty of headroom given the size and simplicity of the requests I ran everything on the same machine. Meaning effectively as close to zero network latency as possible, with considerably more bandwidth than we'll likely see in physical networks for a long time.
First Run - 1 Client
The first run was a straight minimum load test to see what the best-case scenario was, for this I used 1 concurrent client making 1 request every 10 seconds.
As you can see from the graphs MongoDB can significantly outperform LMDB on writes, pulling ahead with a 5ms advantage. I'm not entirely certain why this is, but I suspect it has something to do with MongoDB not using transactions (due to document atomicity) and LMDB using transactions (to prevent corruption due to reads and writes occurring at the same time on a single memory map).
However, when it comes to Reads, LMDB is already showing a subtle lead. This 1ms improvement might seem insignificant when compared to the comparably massive gap in write performance, but it's important to remember that this is under zero-load, so things can change dramatically.
Second Run - Finding the Limits
So let's take a look at what 4000 concurrent clients looks like. Looking at the writes first again, In this case, I focused on random write, however, due to the nature of both databases, the type of write is largely irrelevant. Here you can see that LMDB is simply smoking MongoDB, pulling a massive lead with median requests as low as 14 ms and a 296ms mean. MongoDB on the other hand showed a pitiful performance with a median of 1400ms and a 2522ms mean!
Note: The reason we stopped at approximately 45000 requests during the write tests was that the MongoDB service had been OOM killed when it tried to reserve over 100% of the RAM on the machine. LMDB however didn't use significantly more than the 1GiB that the DB was allocated, and was happily handling the writes at a constant speed (regardless of load).
The read performance surprised me in a big way. For this test, I chose to focus on random-read (rather than sequential, or static). This is because out of the read patterns it is by far the most difficult. When performing random reads I had to cap the concurrent clients to 2000 as MongoDB was hitting the host OS's "Open File Limit" which was leading to general system instabilities. At 2000 Clients There is no contest, LMDB was on average 1000x faster, and unlike MongoDB, LMDB did not experience any errors.
To find the upper limit of LMDB on this system I gradually increased the number of concurrent clients until the response time was similar to MongoDB's performance for 2000 clients. When the LMDB test reached around 10000 clients, Locust also seemed to be feeling the strain, so I'm not 100% sure which was the actual source of the slow down. For the sake of playing the devil's advocate, let's assume that it was LMDB (although it could have easily been Locust or indeed Flask for all I know). At this limit, we still weren't seeing any failures and could have probably pushed it further, but as the average response time was close to 1000ms I chose to call it off. This was still 3x faster response times than MongoDB was capable of when handling at 1/5th of the concurrent users, but I digress.
What does this mean then?
Should everyone make the switch from MongoDB to LMDB? Absolutely not!
Yes, LMDB absolutely obliterates MongoDB, in almost every way measurable when dealing with high load. So if that's what you're optimising for then okay, have at it. But beware; going down this path leads to dragons. MongoDB supports indexing, clustering, backups, and a whole menagerie of other features that you're going to need to build if you want to use LMDB with a similar set of tools.
Another caveat to keep in mind is because LMDB is memory-mapped, your data storage is physically limited to the amount of RAM you have present in the system. This means that for 32-Bit systems you're capped at something like 4GiB. For 64Bit systems, you have a theoretical upper limit in the multi Exa-Byte ranges -- although current x86_64 architectures are limited to something in the range of 4 Peta-Bytes due to using a 56-bit memory address rather than a true 64-bit one (although I wish you luck in finding hardware the comes close to supporting that yet).
Final thoughts
This isn't a drop-in replacement, and it almost certainly never will be, it's an elegant tool for those with the time and patience to build upon it, but it's not remotely beginner-friendly if you're looking for a document store. You need to have significant experience in designing custom binary formats if you want to achieve the performance differences that I'm able to demonstrate here. Using a BSON (or another similar binary format) layer could lower this barrier to entry, but that is going to come with a performance cost that might result in performance that is more similar to MongoDB, than LMDB.
As always, choosing the right tool for the job is the real battle, if you're comfortable dealing with binary formats, then this is definitely a new absolute jackhammer to add to your toolkit.
Further reading
You can find the Part 1 of my new series on building databases in Python with LMDB over at blog.plaintextnerds.com
There's also an interesting interview on how HarperDB utilises LMDB under the hood right here on Dev.to
Twitch.tv/plaintextnerds
If you want to keep up with what I'm work in then why not give me a follow over on Twitch
Top comments (0)