This is a famous computer science problem with various applications in tools such as file diff, bio informatics etc. For more information read this on wiki.

Given a set of strings, how do we find the

*longest common sub-sequence(LCS)*among them?
For example, LCS{"ATDACG","CTDACGA"}

*is*"TAG".
In this post, we will look at the algorithm to find the

*LCS*of two strings. To understand the solution, let us walk through a simple example.
Let us consider the strings "ACDG", "CAG". We can recursively define the LCS of two strings str1, str2 of lengths n,m as follows.

LCS(str1, str2) =

LCS(str1[0:n-2], str2[0:m-2])+str1[n-1]

*if str1[n-1] = str2[m-1]*
Max(LCS(str1,str2[0:m-2], LCS(str1[0:n-2],str2))

*otherwise*
Using the above definition

LCS("ACDG","CAG") = LCS("ACD", "CA") + "G"

This means that since the last character is same, in the sub-problem we can delete it from both the strings.

Further calculations go as follows

LCS("ACD", "CA") = Max( LCS("AC", "CA"), LCS("ACD","C") )

LCS("ACD", "C") = Max( LCS("AC", "C"), LCS("ACD","") )

and so on...

we can see that LCS("ACD","CA") is "C"; so "CG" is the final output.

If we implement this naive recursion, we repeat the calculation for many sub problems. For efficient implementation we use a 2 dimensional array of size (n+1)*(m+1).The table is calculated using the rules given below which directly follow from the recursive definition given above.

table[i][j]= 0Here is an example table.if either of the string is empty

= 1 + table[i-1][j-1] if their last characters are equal

= max(table[i-1][j], table[i][j-1]) otherwise

∅ B A C B A D

∅ 0 0 0 0 0 0 0

A 0 1 1 1 1 1 1

A 0 1 1 1 1 1 1

B 0 1 1 1 2 2 2

A 0 1 2 2 2 3 3

Z 0 1 2 2 2 3 3

D 0 1 2 2 2 3 4

C 0 1 2 3 3 3 4