Before digging into Emmanuel’s article, I decided to look into other articles regarding hash tables and hash maps. Ozturk here discusses a simple hash map implementation in c++, which I think can potentially be a very good article to start with before getting our hands dirty.
If you are/were a computer science student, you have certainly heard of hash functions, and collisions. The first time I personally have heard of this, I wasn’t sure whether a collision was good or bad (oh well, lol). Bottom line is: we don’t like collisions.
We want to have a hash function close to a bijection, or at least, have outputs uniformly distributed, and we want the computation of hashed values to be efficient and fast.
1- State of the art in hash functions for hash tables:
I’m not sure if It’s interesting to look at the variety of hash functions that are out there, but It’s important to say that for hash tables we’re not really worried about attackers to bother ourselves with cryptographic hash functions so we won’t be using those. Non-cryptographic hash functions are much faster : Bob Jenkins has been developing extensively a bunch of hash functions (see here) and his latest interesting hf is lookup3 (0.5 bytes/cycle), MurmurHash3 is faster than lookup3 (1byte/cycle), CityHash was afterwards released by Google and Jenkins also released SpookyHash (happy hALLoWEeN) which are 2 times faster than MurmurHash3 but have other inconvenients of their own (interesting state of the art study here).
========> In a nutshell, the best ones out there so far are MurmurHash3 (for 32 bit machines) and CityHash-SpookyHash(for 64 bit machines).
2- A deep dive into the hash function implementations in pre-existing Key-Value stores:
GCC offers a hash table library called unordered_map in TR1, and It basically is very similar to Java’s implementation of a hashtable, in the sense that every (key, value) pair
The next attribute here helps with the collision handling: just like Java, if two keys have the same hashcode, we have a linked list structure where we store the pairs with the same hashcode for their keys.
(PS: we store both the key and the value, don’t bother me with the whole “how would i know which value corresponds to my key).
I found this figure in Emmanuel’s blog which describes how the hash table is stored in the heap, and as you can see there are memory spaces called ‘Node’ which means we have space where we store a pointer to the actual (key,value) pair.
Google offers a hash table library called dense_hash_map in SparseHash v2.0.2. This hash table structure is more interesting than the latter since, first of all, we don’t lose a bunch of space just writing addresses. We also don’t use linked list to handle collisions, we use quadratic internal probing in case of collisions, and in terms of locality, it’s better than the linked list structure since in every cache line we have 64 bytes and every pair is 16 bytes long, so we can very likely find the pair we’re looking for on the same line (see quadratic internal probing formula here).
Both of the previous hash table implementations are in-memory, which isn’t what we’re going to be implementing ourselves. But It was still interesting to see how It works.
Now on to something similar to what we’ll be doing, HashDb from Kyoto Cabinet.
So HashDB is one of the data structures KC developed, it’s persistent on-disk and it uses separate chaining through a BST for every bucket. AND, the bucket array has a fixed size, so no matter how bad the load factor is, you can never change that and you’re just going to suffer.
It’s not impossible, but it’s costly, since everything is stored on disk and re-hashing the keys after resizing the bucket array will require access to the disk for all the keys, and once you have so many entries, basically, trying to do this is a death wish. You can store the hashed keys, but that comes down to a space/time compromise, so Kyoto Cabinet just thought “fk all” and fixed the bucket array’s size.
Of course, the BST needs a comparable type to classify the nodes, so they use a function called fold_hash which helps with that (honest to God, I didn’t get it; it’s meant to make these keys ‘comparable’, but the function seems to apply operations on the same hash so it’s kinda like...not doing anything? Anyways, let’s hope Emmanuel answers my question on his blog post about this.
An extra cool thing about HashDB, even though It can be a total pain when we have collisions and we’ll have to find the correct pair..., is that memory management should be handled by KC and not the OS, like It’s the case for dense_hash_map and unordered_map. So, we actually store everything in a file and we also have as to where the free blocks of space are (we have the offset, the size of this space).
These free blocks are allocated using std::set’s upper_bound method, which will return the closest free space possible.
Top comments (0)