DEV Community

Levenshtein Distance

Fabrizio BagalĂ  on May 26, 2023

🔄 Updates June 01, 2023 - Added an implementation with modifiable operation costs. Thanks to @cicirello for the suggestion. The Levenshtein dist...
Collapse
 
cicirello profile image
Vincent A. Cicirello

There's a version of the algorithm that allows you to specify the costs of the different edit operations. Very similar to original. You can find the details in:

R. A. Wagner and M. J. Fischer, "The string-to-string correction problem," Journal of the ACM, vol. 21, no. 1, pp. 168–173, January 1974.

Collapse
 
fabriziobagala profile image
Fabrizio BagalĂ 

Thank you for informing me. I was not familiar with this version. I will take a look and possibly implement it in the article.

Collapse
 
cicirello profile image
Vincent A. Cicirello

You're welcome

Thread Thread
 
fabriziobagala profile image
Fabrizio BagalĂ 

@cicirello if I understand the article correctly, the algorithm is the Damerau-Levenshtein algorithm. It handles asymmetric costs and character transpositions. Am I right?

Thread Thread
 
cicirello profile image
Vincent A. Cicirello

No, Wagner and Fischer's algorithm doesn't handle transpositions. It is basically like Levenshtein, but instead of the costs of insertions, deletions, and substitutions each being 1, the algorithm has parameters for the costs of each of those operations. You can specify any non-negative cost for deletions, any non-negative cost for insertions, and any non-negative cost for substitutions.

The algorithm is almost identical to the one your post is about. Basically, where you have the + 1 in the statement for the deletion cost, you'd instead add whatever the cost of a deletion operation. And then in the statement for the insertion case, you'd add the cost of an insertion instead of adding 1. And in the conditional statement for the case of a substitution, that 1 becomes the cost of a substitution. The rest of algorithm remains the same as what you have.

You can then just pass 1 for all 3 costs if you want the Levenshtein case. Or you can pass different costs for the operations if in your application you want to treat the operations differently.

Thread Thread
 
cicirello profile image
Vincent A. Cicirello • Edited

@fabriziobagala after my previous reply, it dawned on me that it might be useful to share my Java implementation of Wagner and Fischer's edit distance algorithm. I have a couple different classes that implements it in a Java library:

  • This one has methods for computing the edit distance between arrays of various types, between Strings, between lists, etc, and uses ints for the three cost values: github.com/cicirello/JavaPermutati...
  • This one is just like the above, but uses doubles for the three cost parameters: github.com/cicirello/JavaPermutati...
  • This last one is for the special case of computing the distance between permutations. The algorithm is the same really. It just predates the other implementations in this library, from when the library was entirely focused on permutations: github.com/cicirello/JavaPermutati...
Thread Thread
 
fabriziobagala profile image
Fabrizio BagalĂ 

Based on what you told me and the code you sent me, I thought of writing this version you suggested in the following way:

public int LevenshteinDistance(string source, string target, int insertionCost, int deletionCost, int substitutionCost)
{
    if (insertionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(insertionCost));
    }

    if (deletionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(deletionCost));
    }

    if (substitutionCost < 0)
    {
        throw new ArgumentException("Value cannot be negative.", nameof(substitutionCost));
    }

    var m = source.Length;
    var n = target.Length;

    var distance = new int[m + 1, n + 1];

    for (var i = 0; i <= m; i++)
    {
        distance[i, 0] = i * deletionCost;
    }

    for (var j = 0; j <= n; j++)
    {
        distance[0, j] = j * insertionCost;
    }

    for (var i = 1; i <= m; i++)
    {
        for (var j = 1; j <= n; j++)
        {
            var cost = source[i - 1] == target[j - 1] ? 0 : substitutionCost;

            var deletion = distance[i - 1, j] + deletionCost;
            var insertion = distance[i, j - 1] + insertionCost;
            var substitution = distance[i - 1, j - 1] + cost;

            distance[i, j] = Math.Min(Math.Min(deletion, insertion), substitution);
        }
    }

    return distance[m, n];
}
Enter fullscreen mode Exit fullscreen mode

What do you think of this implementation?

Thread Thread
 
cicirello profile image
Vincent A. Cicirello

I think that looks correct.

Thread Thread
 
fabriziobagala profile image
Fabrizio BagalĂ 

Great! Then I will include this version in the article. Thank you

Collapse
 
ant_f_dev profile image
Anthony Fung

Thanks for sharing.

I once used this algorithm (from a NuGet package) to do some fuzzy matching. I was trying to find the most likely thing that a user was typing in from a finite set of options - in case there were any typos.

Great to have a better insight into how it actually works.

Collapse
 
fabriziobagala profile image
Fabrizio BagalĂ 

It is an algorithm that I particularly like. It can be used in different areas and it is worth knowing how it works 😉