Once again, I'm finding myself deduplicating a repository of files, and trying to do it right. Yeah, I could use a dedicated tool... but I'm not sure they're doing it correctly.
You see, they hash the file contents and try to match on that, assuming that they're going to be unique for each file. But in reality, it doesn't work that way. You're going to collide.
How do I know this? I wrote my own deduplication tool.
Right now it's using SHA-256 to generate the hash digests, going over the entire file. It stores them in a MariaDB database and compares them to all the other files as it steps through each file. If there's a hash collision, it breaks out the venerable cmp tool and verifies that they're byte-to-byte duplicates.
The result?
In a hand-deduped archive of 69248 files, 420 were false positives.
My question now is: Is SHA-256 the right choice? Which would be better, or should I be using multiple hash algorithms to reduce that false positive number? Speed boosting here is gravy at this point.
Top comments (1)
It is practically impossible to find even one SHA-256 collision, let alone 420 of them. (Search for SHA256 collision). So you must be doing something wrong. Probably truncating the results when you store them in the database. In any case, checking the hash + file size should be sufficient to eliminate any practical possibility of finding a false positive.