This article is my excerpt of Martin Kleppmann's "Designing Data-Intensive Applications" book. These data structures are explained in chapter 3, "Storage and Retrieval", pages 69 through 79. It's a really great read, I would recommend the whole book!
Everything starts with a very dumb key-value database implemented as just two Bash functions:
db_set () {
echo "$1,$2" >> database
}
db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
The idea is to store the data in a CSV-like file:
$ source database.sh
$ db_set 1 'Anakin Skywalker'
$ db_set 2 'Luke Skywalker'
$ db_set 1 'Darth Vader'
$ cat database
1,Anakin Skywalker
2,Luke Skywalker
1,Darth Vader
$ db_get 1
Darth Vader
Note that the first value for the key 1
is overridden by the subsequent write.
This database has pretty good write performance: db_set
just appends the data to a file, which is generally fast. But reads are inefficient, especially on huge data sets: db_get
scans the entire file. Thus, writes are O(1) and reads are O(n).
Next, indices are introduced. An index is a data structure derived from the data itself. Maintaining an index always incurs additional costs, thus, indices always degrade write performance with the benefit of improving the reads.
One of the simplest possible indices is a hash index. This index is nothing more than a dictionary holding bytes offsets of the records in a database. Continuing previous example, assuming every char is one byte, the hash index would look like this:
Whenever you write data into the database, you also update the index. When you want to read a value for a given key, you could quickly look up an offset in the database file. Having the offset, you could use a "seek" operation to jump straight to the data location. Depending on the particular index implementation you could expect a logarithmic complexity for both reads and writes.
Next, Martin deals with the storage efficiency. Appending data to a database file exhausts disk space quickly. The fewer distinct keys you have — the more inefficient this append-only storage engine is. The solution to this problem is compaction:
When a database file grows to a certain size, you stop appending to it, create a new file (called segment) and redirect all the writes to this new file.
Segments are immutable in that sense that they are never used to append any new data. The only way to modify a segment is to write it's content into a new file, possibly with some transformations in between.
So, the compaction creates new segments containing only the most recent records for each key. Another possible enhancement at this step is merging multiple segments into a single one. Compaction and merging could be done, of course, in background. Old segments are just thrown away.
Every segment, including the one being written to, has its own index. So, when you want to find the value for a given key, you search those indices in reverse chronological order: from the most recent, to the oldest.
So far we have a data structure having these pros:
✔️ Sequential writes are generally faster than random ones
✔️ Concurrency is easy to control having a single writer process
✔️ Crash recovery is easy to implement: just read all the segments sequentially, and store the offsets in the in-memory index
✔️ Merging and compaction help to avoid data fragmentation
However, there are some limitations as well:
❗ Crash recovery could be time-consuming if segments are large and numerous
❗ Hash index must fit in memory. Implementing on-disk hash tables is much more difficult
❗ Range queries (BETWEEN
) are virtually impossible
Now, with this background, let's move to the SSTables and LSM-trees. By the way, these abbreviations mean "Sorted String Tables" and "Log-Structured Merge Trees" accordingly.
SSTables are very similar to the "database" that we've seen previously. The only improvement is that we require records in segments to be sorted by key. This might seem to break the ability to use append-only writes, but that's what LSM-Trees for. We'll see in a moment!
SSTables have some advantages over those simple segments we had previously:
✔️ Merging segments is more efficient due to the records being pre-sorted. All you have to do is to compare segment "heads" on each iteration and choose the lowest one. If multiple segments contain the same key, the value from the most recent segment wins. This compact & merge process also holds the sorting of the keys.
✔️ With keys sorted, you don't need to have every single key in the index anymore. If the key B
is known to be somewhere between keys A
and C
you could just do a scan. This also means that range queries are possible!
The final question is: how do you you get the data sorted by key?
The idea, described by Patrick O’Neil et al. in their "The Log-Structured Merge-Tree (LSM-Tree)", is simple: there are in-memory data structures, such as red-black trees or AVL-trees, that are good at sorting data. So, you split writes into two stages. First, you write the data into the in-memory balanced tree. Second, you flush that tree on the disk. Actually, there may be more than two stages, with deeper ones being bigger and "slower" then the upper (as shown in this answer).
- When a write comes, you add it to the in-memory balanced tree, called memtable.
- When the memtable grows big, it is flushed to the disk. It is already sorted, so it naturally creates an SSTable segment.
- Meanwhile, writes are processed by a fresh memtable.
- Reads are first being looked up in the memtable, then in the segments, starting from the most recent one to the oldest.
- Segments are compacted and merged from time to time in background as described previously.
The scheme is not perfect, it could suffer from sudden crashes: the memtable, being an in-memory data structure, is lost. This issue could be solved by maintaining another append-only file that basically duplicates the contents of the memtable. The database only needs to read it after a crash to re-create the memtable.
And that's it! Note that all the issues of a simple append-only storage, described above, are now solved:
✔️ Now there is only one file to read in a case of a crash: the memtable backup
✔️ Indices could be sparse, thus fitting the RAM is easier
✔️ Range queries are now possible
TLDR: An SSTable is a key-sorted append-only key-value storage. An LSM-tree is a layered data structure, based on a balanced tree, that allows SSTables to exist without the controversy of being both sorted and append-only at the same time.
Congrats, you've finished this long read! If you enjoyed the explanation, make sure to check out the whole book!
Top comments (0)