Sup dudes! 😎
Today's story is a bit unusual but surely is interesting. I was learning about binary search trees in the Coursera when the professor comes up with the story when a bug was found after 9 years of using binary search implementation on Java in 2006. It was only found when someone was using large values of the int variables low and high. Let me explain.
First things first! 🤓
What is a binary search tree?
According to Wikipedia, a binary search tree (BST) is
[...] A particular type of container: a data structure that stores "items" in memory. They allow fast lookup, addition, and removal of items, and can be used to implement either dynamic sets of items or lookup tables that allow finding an item by its key. They keep their keys in sorted order so that lookup and other operations can use the principle of binary search: when looking for a key in a tree, they traverse the tree from root to leaf, making comparisons to keys stored in the nods of the tree and deciding, based on the comparison, to continue searching in the left or right subtrees. This means, that on average each comparison allows the operations to skip about half of the tree.
Ok, but can we summarize? Of course! Imagine a tree where every element is sorted, to search in it we need to check, from the root, if the element is less or greater than the start. If it's less, we go left, if it's greater we go right.
Let's use the example in the gif. We need to search the number 27; 21 is the root, so let's start there.
- Is 27 greater or equal than 21? Yes, so go right.
- Is 27 greater or equal than 28? No, so go left.
- Is 27 greater or equal than 25? No, so go right.
- Is 27 greater or equal than 27? Yes, we found it!
The gif also shows a comparison to a sorted array search. We see that BST is much faster!
Ok, now let's see the implementation of the BST
public static int binarySearch(int[] a, int key) {
int low = 0;
int high = a.length - 1;
while (low <= high) {
int mid = (low + high) / 2;
int midVal = a[mid];
if (midVal < key)
low = mid + 1;
else if (midVal > key)
high = mid - 1;
else
return mid; // key found
}
return -(low + 1); // key not found.
}
This a standard binary search in Java, this was written in the java.util.Arrays.
Looking from different angles is hard to find the bug, actually, it's correct, but not actually... Ok, let me explain...
The bug 👨💻
The bug is in the mathematical details of this line:
int mid =(low + high) / 2;
But why?
Well, according to Joshua Bloch in his article,
In Programming Pearls Bentley says that the analogous line "sets m to the average of l and u, truncated down to the nearest integer." On the face of it, this assertion might appear correct, but it fails for large values of the int variables low and high. Specifically, it fails if the sum of low and high is greater than the maximum positive int value (231 - 1). The sum overflows to a negative value, and the value stays negative when divided by two. In C this causes an array index out of bounds with unpredictable results. In Java, it throws ArrayIndexOutOfBoundsException.
This bug can manifest itself for arrays whose length (in elements) is 230 or greater (roughly a billion elements). This was inconceivable back in the '80s, when Programming Pearls was written, but it is common these days at Google and other places. In Programming Pearls, Bentley says "While the first binary search was published in 1946, the first binary search that works correctly for all values of n did not appear until 1962." The truth is, very few correct versions have ever been published, at least in mainstream programming languages.
TLDR: The int data type has a limit, when summing two extraordinary big values we get an overflow and it has a negative value, then the value stays negative when divided by two causing the bug.
The correct way 👍
Now, let's see the correct way to fix it:
int mid = low + ((high - low) / 2);
this now prevents the overflow and (at least for now) makes it bug-free. At least, we strongly suspect it's bug-free.
Now, after this we can feel better for getting this wrong on college, even the greatest can fail! 😂
See you later!
References:
https://www.mathwarehouse.com/programming/gifs/binary-search-tree.php
https://en.wikipedia.org/wiki/Binary_search_tree
https://ai.googleblog.com/2006/06/extra-extra-read-all-about-it-nearly.html
Top comments (8)
I've seen it straight away. This has been discussed recently quite a lot, I'm surprised an Enterprise programming language has such naive bugs.
Oh well, I guess it's Java, so I'm not THAT surprised.
On point my dude! 🤣🔥
Another way to find the mid: int mid = (low + high) >>> 1.
Source: ai.googleblog.com/2006/06/extra-ex...
Line no. 10, the implementation of BST, you forgot to add a ;
:p
Oh gosh! Thanks for letting me know!
why dont we just have int mid = low/2 + high/2;
Because that's wrong :D
low = 3;
high = 5;
int mid = 3/2 + 5/2 is actually 1 + 2 = 3 because division returns integers. And that gives the wrong result.
int mid =(low + high) / 2 = (3+5) / 2 = 8 / 2 = 4. This is the correct mid.
The "sorted array search" example is the behavior of O(n) search in an unsorted array.
You're already applying binary search to a sorted array by using the indices.