The hash tables are great when we need to search a string in a big set of strings.
They provide O(1) complexity this way making our algorithms very fast.
Unfortunately, the hash tables have one big disadvantage, especially when we are talking about huge sets of strings.
Because it is important to keep the hash collisions as low as possible, the size of the hash table must be at least twice the count of the strings we need to put inside.
This way, on a very big set of strings, the memory waste can be really huge. And if we fill the table more, the speed of the search degrades very fast down to simple linear search with O(n).
There are different ways to counteract these problems. For one of them I am writing here.
The author of the algorithm is Tomasz Grysztar, the author of FlatAssembler.
FlatAssembler uses this algorithm in order to handle huge number of label names during compilation.
The algorithm is slower than the classic hash table, but still will keep the complexity to O(log k) where k is the size of the hash, not the size of the set.
This way, the search time will remain constant regardless of the strings count in the set and can be assumed to be O(1).
Considering the much lower number of collisions, on large sets it can be faster than a classic hash table with smaller hash size.
Adding strings to the tree.
At first we need to compute the hash of the string. It is good idea to use relatively big hash, for example 32bit, in order to minimize the collisions.
I am usually using the FNV-1b hash, but the hash algorithm is not very important for the understanding of the algorithm.
Then we need a linear memory area, where elements of the following type will be allocated consecutively:
struct THashTreeNode
.bit0 dd ?
.bit1 dd ?
ends
Then starting from the first array element and shifting the hash one bit right at a time, we put in the .bit0
or .bit1
fields (depending on the value of the bit shifted) the offset to the next element of the array where to continue the scan.
The next element is allocated at the end of the array, just after the last used element. (like in a queue or linear list).
When we reach the 32nd bit of the hash, the last index will point to the element with the payload (or the leaf of the tree if you prefer) - it can occupy random space, but it is more convenient if the size of the leaf structure is multiple of the size of the node structure.
In my implementation it occupies two consecutive elements of of the array:
struct THashTreeLeaf
.hString dd ? ; a handle of the string stored.
.next dd ? ; the next element in the cases of collisions.
.lparam dd ? ; user defined parameter.
.wparam dd ? ; user defined parameter.
ends
The field .hString
contains a pointer or a handle to the string we put in the tree. (in the library I use, the strings are dynamic and are represented by a handles, not pointers. But the difference is not important)
The field .next
contains an offset to the next THashTreeLeaf
in the rare cases if we have two different strings with the same hash (a collision).
Searching the string in the set.
The searching algorithm is almost the same as when we are adding strings to the set.
Starting from the beginning of the array, we are shifting the searched string hash one bit at a time and jumping to the offsets from the .bit0
and .bit1
fields. If on some step the offset is 0 this means the string is not in the tree.
If after 32 steps, we find a leaf element, a direct string comparison with the stored string is needed in order to resolve the possible collision.
Then if the strings does not match, we need to follow the offsets from the THashTreeLeaf.next
fields until 0 (the string is not in the tree) or string match - the string is in the tree.
You can see that in the later case, the search deteriorates to a simple and slow linked list search. Fortunately, when the hash is 32 bit such collisions are really rare.
Advantages
The algorithm is memory friendly.
The whole tree is in one contiguous memory area, which is good for the CPU caches and additionally improve the performance.
The algorithm is relatively simple, at least for assembly language implementation.
I am not sure how it will be implemented in higher level languages, but at least for C/C++/Pascal there should be no problems, except for some possible code readability issues...
Notice, that the same algorithm can be implemented as an usual tree with separately allocated nodes and leafs. But such implementation will be slower, because of the heap allocations of multiply small chunks of memory and because of the cache issues.
The quiz
In order to make the things even more clear, here is the full code of the procedure that can add strings to the tree and also to search the tree for a string without adding it.
Because the searching and adding algorithms are very similar, it is possible to provide both functions in a single procedure.
As a little quiz to the readers I will not put detailed line-by-line explanations of the code. It is published in the same form it exists in the production library.
I will give only a little reference to the external procedures used at the end of the article.
So, that is all. Here is the code. Try to analyze and understand it! :)
If something is not clear ask in the comments.
I will explain it in details, but let give some fun time for the puzzle solvers.
; Searches the hash tree for the given string.
; if [.fAdd] is TRUE and if the string is not in the tree it will be added.
;
; Arguments:
; .pHashTree - pointer to TArray with elements size sizeof.THashTreeNode (8 bytes)
; .hString - handle/pointer to the string to be searched/added. Notice, that exactly this
; string will be added to the THashTreeLeaf element. The caller is responsible
; to not free this string until the hash tree is needed.
; .fAdd - flag indicating whether to add the string to the tree if is not included.
;
; Returns:
; edx - pointer to the array. Can be reallocated!
; eax - index in the TArray of the THashTreeLeaf of the string.
; CF = 1: the string is already in the tree.
; CF = 0: the string is not in the tree.
; [.fAdd] <> 0: eax contains a pointer to the new added leaf.
; eax == 0 means there is an out of memory error on
; the attempt to add the string.
; [.fAdd] = 0: eax == 0
proc SearchHashTree, .pHashTree, .hString, .fAdd
.hash dd ?
begin
pushad
mov edx, [.pHashTree]
stdcall StrHash, [.hString]
mov [.hash], eax
mov ecx, 32
mov ebx, TArray.array
cmp [edx+TArray.count], 0
jne .treeseek
stdcall AddArrayItems, edx, 1
and dword [eax+THashTreeNode.bit0], 0
and dword [eax+THashTreeNode.bit1], 0
.treeseek:
xor edi, edi
ror [.hash], 1
adc edi, edi
lea edi, [ebx+4*edi]
cmp dword [edx + edi], 0 ; edi is offset from the beginning
; of the array.
je .notfound
mov ebx, [edx + edi]
loop .treeseek
; The hash exists in the hash tree. Process the possible collisions.
; here ebx is the offset to the THashTreeLeaf !
.cmp_loop:
stdcall StrCompCase, [.hString], [edx+ebx+THashTreeLeaf.hString]
jc .finish ; the string is already in the hash tree!!!
lea edi, [ebx + THashTreeLeaf.next]
cmp dword [edx + edi], 0
je .add_in_list
mov ebx, [edx + edi]
jmp .cmp_loop
.add_in_list:
cmp [.fAdd], 0
je .finish_zero
stdcall AddArrayItems, edx, 2
jc .finish_zero
sub eax, edx
jmp .do_add
.notfound:
cmp [.fAdd], 0
je .finish_zero
; edx - pointer to the tree array, ebx - offset of the last
; found element.
; eax - 0 or 1 depending of the last bit checked.
; ecx - remaining bits count.
; edi - the offset to the [.bitX] field of the last node.
.add_remaining:
lea eax, [ecx+1]
stdcall AddArrayItems, edx, eax ; add all needed (THashTreeNode) + THashTreeLeaf
jc .finish_zero
sub eax, edx
dec ecx
jz .do_add
.addloop:
mov [edx + edi], eax
and dword [edx + eax + THashTreeNode.bit0], 0
and dword [edx + eax + THashTreeNode.bit1], 0
xor edi, edi
ror [.hash], 1
adc edi, edi
lea edi, [eax + 4*edi]
add eax, sizeof.THashTreeNode
loop .addloop
.do_add:
mov [edx + edi], eax
lea ebx, [edx+eax]
mov ecx, [.hString]
xor eax, eax
mov [ebx+THashTreeLeaf.hString], ecx
mov [ebx+THashTreeLeaf.next], eax
mov [ebx+THashTreeLeaf.lparam], eax
mov [ebx+THashTreeLeaf.wparam], eax
clc
.finish:
mov [esp+4*regEAX], ebx
mov [esp+4*regEDX], edx
popad
return
.finish_zero:
xor ebx, ebx
clc
jmp .finish
endp
Very small cheat sheet:
TArray
is a structure of a dynamic array that can be resized when needed. (the source)
The procedure AddArrayElements
adds new elements at the end of TArray and returns in EAX
a pointer to these new elements and in EDX
the pointer to the array that can be reallocated because of the resizing.
The procedure StrHash
computes 32bit hash of a string and returns it in EAX
The procedure StrCompCase
compares two strings, case sensitive and returns Boolean result in CF flag. 1 if the string match and 0 if not.
Quick assembly instruction reference for the absolute beginners: here
NB: The above implementation is far from optimal. It is written in order to be readable and correct. The optimizations will be made later. The bug fixes as well.
Top comments (0)