đ Updates
June 01, 2023 - Added an implementation with modifiable operation costs. Thanks to @cicirello for the suggestion.
The Levenshtein distance, or the edit distance, is a crucial algorithm in the landscape of computer science, especially pertinent in text analysis and natural language processing (NLP). This mathematical yardstick quantifies the 'distance' between two character strings by computing the least number of operations required to transform one string into the other.
Brief history
Named after Russian scientist Vladimir Levenshtein who introduced it in 1965, the Levenshtein distance calculates the difference between two sequences using insertion, deletion, and substitution operations. The significance of this concept has transcended its original purpose over time, finding applications in a myriad of contexts, ranging from spell-checking algorithms to voice recognition systems.
How it works
The calculation of the Levenshtein distance relies on dynamic programming. It involves constructing an m x n
matrix, where m
and n
denote the lengths of the two strings being compared. Each cell (i, j)
in this matrix encapsulates the Levenshtein distance between the first i
characters of the first string and the first j
characters of the second string.
The algorithm revolves around three elementary operations:
- Insertion: Appending a new character to the string.
- Deletion: Eradicating a character from the string.
- Substitution: Replacing one character with another.
The algorithm computes the minimum distance for each cell, taking into account all potential operations. This computation depends on the values of neighboring cells, exemplifying the dynamic programming approach.
Here is a simple implementation in C#:
public int LevenshteinDistance(string source, string target)
{
ArgumentNullException.ThrowIfNull(source);
ArgumentNullException.ThrowIfNull(target);
var m = source.Length;
var n = target.Length;
var distance = new int[m + 1, n + 1];
for (var i = 0; i <= m; i++)
{
distance[i, 0] = i;
}
for (var j = 0; j <= n; j++)
{
distance[0, j] = j;
}
for (var i = 1; i <= m; i++)
{
for (var j = 1; j <= n; j++)
{
var cost = target[j - 1] == source[i - 1] ? 0 : 1;
var deletion = distance[i - 1, j] + 1;
var insertion = distance[i, j - 1] + 1;
var substitution = distance[i - 1, j - 1] + cost;
distance[i, j] = Math.Min(Math.Min(deletion, insertion), substitution);
}
}
return distance[m, n];
}
Time complexity
Given that the Levenshtein distance algorithm operates on an m x n
matrix, the time complexity is O(mn), making it polynomial. This is because each cell in the matrix has to be filled, and the number of cells is proportional to the product of the lengths of the two strings.
Space complexity
Similarly, the space complexity is also O(mn), as the entire matrix needs to be stored in memory. In some variations of the algorithm, the space complexity can be reduced to O(min(m, n)) by maintaining only the current and previous row (or column) in memory.
Other implementations
Let us explore below some implementations which, although similar to the original implementation, have some differences.
đ Using two rows
public int LevenshteinDistance(string source, string target)
{
ArgumentNullException.ThrowIfNull(source);
ArgumentNullException.ThrowIfNull(target);
var m = source.Length;
var n = target.Length;
if (m == 0)
{
return n;
}
if (n == 0)
{
return m;
}
var previousRow = new int[n + 1];
var currentRow = new int[n + 1];
for (var j = 0; j <= n; j++)
{
previousRow[j] = j;
}
for (var i = 1; i <= m; i++)
{
currentRow[0] = i;
for (var j = 1; j <= n; j++)
{
var cost = source[i - 1] == target[j - 1] ? 0 : 1;
var deletion = previousRow[j] + 1;
var insertion = currentRow[j - 1] + 1;
var substitution = previousRow[j - 1] + cost;
currentRow[j] = Math.Min(Math.Min(deletion, insertion), substitution);
}
(previousRow, currentRow) = (currentRow, previousRow);
}
return previousRow[n];
}
In this version, a null check is first performed for the source
and target
strings are performed first, throwing an ArgumentNullException
if either is null
.
The algorithm returns the length of target
or source
if either string length is 0, as the Levenshtein distance is equivalent to the length of the other string.
It then initializes two integer arrays previousRow
and currentRow
, both of size n + 1
, where n
is the length of target
. previousRow
is filled with values from 0 to n.
Iterating over each character in source, currentRow[0]
is set to the current source index. In a nested loop for each character in target
, it calculates the cost of substitution, deletion, insertion, and substitution costs. Then, currentRow[j]
is updated to the minimum of these costs.
At the end of the outer loop, previousRow
and currentRow
are swapped, preparing previousRow
for the next iteration with the updated distances.
After all iterations, it returns the last value in previousRow
, which is the Levenshtein distance between source
and target
.
This algorithm has a space complexity of O(n) and a time complexity of O(mn), where n
and m
are the lengths of the target
and source
strings, respectively.
đ Using one row
public int LevenshteinDistance(string source, string target)
{
ArgumentNullException.ThrowIfNull(source);
ArgumentNullException.ThrowIfNull(target);
if (source.Length < target.Length)
{
(source, target) = (target, source);
}
var m = source.Length;
var n = target.Length;
var currentRow = new int[n + 1];
for (var i = 0; i <= n; i++)
{
currentRow[i] = i;
}
for (var i = 1; i <= m; i++)
{
var previousDiagonal = currentRow[0];
currentRow[0] = i;
for (var j = 1; j <= n; j++)
{
var previousDiagonalCopy = currentRow[j];
var cost = source[i - 1] == target[j - 1] ? 0 : 1;
var deletion = currentRow[j - 1] + 1;
var insertion = currentRow[j] + 1;
var substitution = previousDiagonal + cost;
currentRow[j] = Math.Min(Math.Min(deletion, insertion), substitution);
previousDiagonal = previousDiagonalCopy;
}
}
return currentRow[n];
}
In this version, a null check is first performed for the source
and target
strings are performed first, throwing an ArgumentNullException
if either is null
.
The strings are then swapped if source
is shorter than target
to optimize space complexity.
The algorithm initializes an integer array currentRow
of size n + 1
, which represents the current row in the Levenshtein distance matrix, filled with values from 0 to n.
It iterates over each character in source, saving the first value of currentRow
to previousDiagonal
and updating the first value to the current source index.
In a nested loop, it iterates over each character in target
. It calculates the cost of substitution, and also the deletion and insertion costs. Then, currentRow[j]
is set to the smallest among deletion, insertion, and substitution. previousDiagonal
is updated for the next iteration.
After all iterations, the function returns the last value in currentRow
, representing the Levenshtein distance between source
and target
.
This function has a space complexity of O(n) and a time complexity of O(mn), where n
and m
are the lengths of the target
and source
strings, respectively.
đ With modifiable operation costs
public int LevenshteinDistance(string source, string target, int insertionCost, int deletionCost, int substitutionCost)
{
if (insertionCost < 0)
{
throw new ArgumentException("Value cannot be negative.", nameof(insertionCost));
}
if (deletionCost < 0)
{
throw new ArgumentException("Value cannot be negative.", nameof(deletionCost));
}
if (substitutionCost < 0)
{
throw new ArgumentException("Value cannot be negative.", nameof(substitutionCost));
}
var m = source.Length;
var n = target.Length;
var distance = new int[m + 1, n + 1];
for (var i = 0; i <= m; i++)
{
distance[i, 0] = i * deletionCost;
}
for (var j = 0; j <= n; j++)
{
distance[0, j] = j * insertionCost;
}
for (var i = 1; i <= m; i++)
{
for (var j = 1; j <= n; j++)
{
var cost = source[i - 1] == target[j - 1] ? 0 : substitutionCost;
var deletion = distance[i - 1, j] + deletionCost;
var insertion = distance[i, j - 1] + insertionCost;
var substitution = distance[i - 1, j - 1] + cost;
distance[i, j] = Math.Min(Math.Min(deletion, insertion), substitution);
}
}
return distance[m, n];
}
In this version, the user can define the costs of insertion, deletion, and substitution. Initially, the algorithm checks if these costs are negative, throwing an ArgumentException
exception if they are.
Next, the algorithm calculates the length of the two strings to be compared, called source
and target
. These lengths are stored in the variables m
and n
.
Subsequently, the algorithm creates a two-dimensional matrix, distance
, of size (m + 1) x (n + 1)
to hold the intermediate and final transformation costs.
The algorithm then initializes the first column of the matrix with the progressive deletion costs of the source
string, and the first row with the progressive insertion costs into the target
string.
The next step involves iterating over each character in both strings. For each pair of characters, the algorithm calculates the substitution cost (which is 0 if the characters are identical, otherwise it's the provided substitution cost), the deletion cost, and the insertion cost. The minimum among these three costs is then assigned to the corresponding element in the distance
matrix.
Finally, the bottom-right element of the matrix represents the minimum Levenshtein distance between the two strings. This value is returned as the algorithm's output.
Similarity percentage
To determinate the similarity percentage between two strings leveraging the Levenshtein distance, initially, the greatest potential Levenshtein distance is calculated - equivalent to the length of the longer string. Then, the ratio of the actual Levenshtein distance to this maximum is evaluated.
The formula is this:
In this formula, a similarity of 1.0 denotes identical strings, while a similarity of 0.0 denotes completely distinct strings.
A representation in C#:
public double CalculateSimilarity(string source, string target)
{
var distance = LevenshteinDistance(source, target);
var maxLength = Math.Max(source.Length, target.Length);
return 1.0 - (double)distance / maxLength;
}
This function calculates the Levenshtein distance between the source and target strings, determines the maximum length between the two, and finally computes the similarity percentage based on the aforementioned formula.
Applications
The Levenshtein distance has a broad range of applications:
- Spell-checking: Levenshtein assists in pinpointing words 'closest' to a misspelled word, thereby suggesting the likeliest spelling correction.
- Computational biology: It quantifies the resemblance between two nucleotide strings in the analysis of DNA sequences.
- Speech recognition: In the realm of speech recognition, it's used to compare uttered words with those in a database.
Conclusion
The Levenshtein distance is a potent instrument for gauging the difference between two strings. While the concept might appear simple, its applications are astonishingly diverse and relevant, underscoring the importance of this algorithm in contemporary technology. Consequently, understanding the Levenshtein distance is pivotal for anyone involved in computer science and related fields.
Top comments (11)
There's a version of the algorithm that allows you to specify the costs of the different edit operations. Very similar to original. You can find the details in:
Thank you for informing me. I was not familiar with this version. I will take a look and possibly implement it in the article.
You're welcome
@cicirello if I understand the article correctly, the algorithm is the Damerau-Levenshtein algorithm. It handles asymmetric costs and character transpositions. Am I right?
No, Wagner and Fischer's algorithm doesn't handle transpositions. It is basically like Levenshtein, but instead of the costs of insertions, deletions, and substitutions each being 1, the algorithm has parameters for the costs of each of those operations. You can specify any non-negative cost for deletions, any non-negative cost for insertions, and any non-negative cost for substitutions.
The algorithm is almost identical to the one your post is about. Basically, where you have the
+ 1
in the statement for the deletion cost, you'd instead add whatever the cost of a deletion operation. And then in the statement for the insertion case, you'd add the cost of an insertion instead of adding 1. And in the conditional statement for the case of a substitution, that 1 becomes the cost of a substitution. The rest of algorithm remains the same as what you have.You can then just pass 1 for all 3 costs if you want the Levenshtein case. Or you can pass different costs for the operations if in your application you want to treat the operations differently.
@fabriziobagala after my previous reply, it dawned on me that it might be useful to share my Java implementation of Wagner and Fischer's edit distance algorithm. I have a couple different classes that implements it in a Java library:
Based on what you told me and the code you sent me, I thought of writing this version you suggested in the following way:
What do you think of this implementation?
I think that looks correct.
Great! Then I will include this version in the article. Thank you
Thanks for sharing.
I once used this algorithm (from a NuGet package) to do some fuzzy matching. I was trying to find the most likely thing that a user was typing in from a finite set of options - in case there were any typos.
Great to have a better insight into how it actually works.
It is an algorithm that I particularly like. It can be used in different areas and it is worth knowing how it works đ