Thursday, December 5, 2013

Finding the edit distance between two strings

Given two strings as input, write a program to find the edit distance between them. 

The edit distance is defined as the minimum number of modifications to do on one string to transform it into the other string. A modification can be inserting a character, deleting a character, or replacing an existing character.

For example, consider the two strings "saturday" and "sunday". The edit distance is 3. There is no way, we can change one string to the other in less than 3 modifications.

The 3 modifications that can be done on "saturday" are
remove 'a' --> sturday
remove 't' --> surday
replace 'r' with 'n' --> sunday

The following is the algorithm based on dynamic programming.

Let us assume string1 is of length N and string2 is of length M. We create a table of size (N+1) X (M+1).

The entry table[i][j] gives the edit distance between prefix of string1 ending with ith character and prefix of string2 ending with jth character.

0 <= i <= N and 0 <= j <= M

We fill this table in a bottom-up manner. This is essentially solving the problem by combining the results of it's sub-problems. (Dynamic programming). The entries are filled using the following formula

table[0][i] = i
table[i][0] = i

Assume that 0th character is empty, the edit distance between an empty string and a string of length 'i' is always 'i' because by deleting all the 'i' characters, it becomes an empty string.

table[i][j] = table[i-1][j-1] if string1[i] = string2[j]

If string1[i] and string2[j] are equal, we don't need any modifications. Hence it is as same as table[i-1][j-1]

table[i][j] = Min(table[i-1][j],table[i][j-1],table[i-1][j-1]) + 1

If string1[i] and string2[j] are not equal, we have three choices.
  • Modify the differing character string1[i] to match string2[j]. This requires a cost of table[i-1][j-1] + 1
  • Delete string1[i] and try to match the remaining with string2. This requires a cost of table[i-1][j] + 1
  • Insert string1[i] into string2. This requires  a cost of table[i][j-1] + 1.
Finally table[N][M] contains the required answer.

The time and space complexity for this algorithm is O(n * m).
Following is the C++ implementation of the above algorithm.