Longest Common Subsequences in Bioinformatics

undefined
 
L
o
n
g
e
s
t
 
C
o
m
m
o
n
S
u
b
s
e
q
u
e
n
c
e
s
 
Michael T. Goodrich
University of California, Irvine
 
Application: DNA Sequence Alignment
 
D
N
A
 
s
e
q
u
e
n
c
e
s
 
c
a
n
 
b
e
 
v
i
e
w
e
d
 
a
s
 
s
t
r
i
n
g
s
 
o
f
A
,
 
C
,
 
G
,
 
a
n
d
 
T
 
c
h
a
r
a
c
t
e
r
s
,
 
w
h
i
c
h
 
r
e
p
r
e
s
e
n
t
n
u
c
l
e
o
t
i
d
e
s
.
Finding the similarities between two DNA
sequences is an important computation
performed in bioinformatics.
For instance, when comparing the DNA of different
organisms, such alignments can highlight the
locations where those organisms have identical
DNA patterns.
 
Application: DNA Sequence Alignment
 
Finding the best alignment between two DNA strings
involves minimizing the number of changes to convert
one string to the other.
 
 
 
 
 
 
A
 
b
r
u
t
e
-
f
o
r
c
e
 
s
e
a
r
c
h
 
w
o
u
l
d
 
t
a
k
e
 
e
x
p
o
n
e
n
t
i
a
l
 
t
i
m
e
,
 
b
u
t
w
e
 
c
a
n
 
d
o
 
m
u
c
h
 
b
e
t
t
e
r
 
u
s
i
n
g
 
d
y
n
a
m
i
c
 
p
r
o
g
r
a
m
m
i
n
g
.
 
Dynamic Programming
 
Applies to a problem that at first seems to
require a lot of time (possibly exponential),
provided we have:
Simple subproblems:
 the subproblems can be
defined in terms of a few variables, such as j, k, l, m,
and so on.
Subproblem optimality:
 the global optimum value
can be defined in terms of optimal subproblems
Subproblem overlap:
 the subproblems are not
independent, but instead they overlap (hence,
should be constructed bottom-up).
 
Subsequences
 
A 
subsequence
 of a character string
x
0
x
1
x
2
…x
n-1
 is a string of the form x
i
1
x
i
2
…x
i
k
,
where i
j
 < i
j+1
.
Not the same as substring!
Example String: ABCDEFGHIJK
Subsequence: ACEGJIK
Subsequence: DFGHK
Not subsequence: DAGH
 
The Longest Common Subsequence
(LCS) Problem
 
Given two strings X and Y, the longest
common subsequence (LCS) problem is to
find a longest subsequence common to
both X and Y
Has applications to DNA similarity testing
(alphabet is {A,C,G,T})
Example: ABCDEFG and XZACKDFWGH
have ACDFG as a longest common
subsequence
 
A Poor Approach to the LCS Problem
 
A Brute-force solution:
Enumerate all subsequences of X
Test which ones are also subsequences of Y
Pick the longest one.
Analysis:
If X is of length n, then it has 2
n
subsequences
This is an exponential-time algorithm!
 
LCS
 
8
 
A Dynamic-Programming Approach
to the LCS Problem
 
Define L[i,j] to be the length of the longest common
subsequence of X[0..i] and Y[0..j].
Allow for -1 as an index, so L[-1,k] = 0 and L[k,-1]=0, to
indicate that the null part of X or Y has no match with the other.
Then we can define L[i,j] in the general case as follows:
1.
If xi=yj, then L[i,j] = L[i-1,j-1] + 1 (we can add this match)
2.
If xi
≠yj, then L[i,j] = max{L[i-1,j], L[i,j-1]} (we have no
match here)
 
Case 1:
 
Case 2:
 
An LCS Algorithm
 
Algorithm 
LCS(X,Y ):
Input:
 
Strings X and Y with n and m elements, respectively
Output: 
For i = 0,…,n-1, j = 0,...,m-1, the length L[i, j] of a longest string
that is a subsequence of both the string X[0..i] = x
0
x
1
x
2
…x
i  
and the
string Y [0.. j] = y
0
y
1
y
2
…y
j
for 
i =1 to n-1 
do
 
L[i,-1] = 0
for 
j =0 to m-1 
do
 
L[-1,j] = 0
for 
i =0 to n-1 
do
 
for 
j =0 to m-1 
do
  
if 
x
i
 = y
j
  
then
   
L[i, j] = L[i-1, j-1] + 1
  
else
   
L[i, j] = max{L[i-1, j] , L[i, j-1]}
return 
array L
 
Visualizing the LCS Algorithm
 
Analysis of LCS Algorithm
 
We have two nested loops
The outer one iterates 
n
 times
The inner one iterates 
m
 times
Thus, the total running time is O(
nm
)
For example, if n = 1 million and m = 1 million, and we
can do ops in 1 ns, then we need approx. 17 minutes.
Answer is contained in L[n,m] (and the subsequence
can be recovered from the L table).
This takes O(nm) space, however!
For example, if n = 1 million and m = 1 million, then we
need over 1 TB of memory for the L table!
 
Hirschberg’s Algorithm
 
Hirschberg’s algorithm is
a way of computing the
LCS in linear space.
Dan Hirschberg is a
faculty member in the
Department of Computer
Science at University of
California, Irvine.
He teaches Algorithms
and Data Compression
in the MCS program.
 
Idea #1
 
When we are computing the L table, we only need
to store the previous row and the current row.
If we just want to know the last row of the LCS
(e.g., to compute the length of the LCS), this is
enough to achieve linear space.
 
Review of the LCS Algorithm
 
To match Hirschberg’s notation, let A be string of
length m and B be string of length n (both indexed
starting from 1).
 
Applying Idea #1
 
Use a 2 x n matrix, K, instead of L.
Let A be string of length m and B be string of
length n (indexed starting from 1). The following
algorithm gives the last line, LL, of the L table:
 
Idea #2: Divide and Conquer Algorithm
 
Where does the LCS path cross the middle
column of the L table?
 
 
 
 
 
For i=m/2, find the place, k, where the LCS path
crosses the middle column using Algorithm B.
Then recursively find the LCS for A[1,i] and B[1,k]
and the LCS for A[i+1,m] and B[k+1,n].
 
Notation
 
Algorithm Analysis
 
Space is O(n + m).
Time: T(m,n) = T(m/2, j) + T(m/2, n-j) + cnm
 
Prove by induction that
T(m,n) <= 2cmn
 
In class…
Slide Note
Embed
Share

DNA Sequence Alignment is a crucial task in bioinformatics, where dynamic programming helps in finding the best alignment between DNA strings efficiently. The Longest Common Subsequence (LCS) problem aims to discover the longest shared subsequence between two strings, offering applications in DNA similarity testing. While a brute-force approach to solving the LCS problem is possible, it can be highly inefficient due to its exponential time complexity.

  • Bioinformatics
  • DNA Sequences
  • Longest Common Subsequence
  • Dynamic Programming

Uploaded on May 10, 2024 | 1 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Longest Common Subsequences Michael T. Goodrich University of California, Irvine

  2. Application: DNA Sequence Alignment DNA sequences can be viewed as strings of A, C, G, and T characters, which represent nucleotides. Finding the similarities between two DNA sequences is an important computation performed in bioinformatics. For instance, when comparing the DNA of different organisms, such alignments can highlight the locations where those organisms have identical DNA patterns.

  3. Application: DNA Sequence Alignment Finding the best alignment between two DNA strings involves minimizing the number of changes to convert one string to the other. A brute-force search would take exponential time, but we can do much better using dynamic programming.

  4. Dynamic Programming Applies to a problem that at first seems to require a lot of time (possibly exponential), provided we have: Simple subproblems: the subproblems can be defined in terms of a few variables, such as j, k, l, m, and so on. Subproblem optimality: the global optimum value can be defined in terms of optimal subproblems Subproblem overlap: the subproblems are not independent, but instead they overlap (hence, should be constructed bottom-up).

  5. Subsequences A subsequence of a character string x0x1x2 xn-1is a string of the form xi1xi2 xik, where ij < ij+1. Not the same as substring! Example String: ABCDEFGHIJK Subsequence: ACEGJIK Subsequence: DFGHK Not subsequence: DAGH

  6. The Longest Common Subsequence (LCS) Problem Given two strings X and Y, the longest common subsequence (LCS) problem is to find a longest subsequence common to both X and Y Has applications to DNA similarity testing (alphabet is {A,C,G,T}) Example: ABCDEFG and XZACKDFWGH have ACDFG as a longest common subsequence

  7. A Poor Approach to the LCS Problem A Brute-force solution: Enumerate all subsequences of X Test which ones are also subsequences of Y Pick the longest one. Analysis: If X is of length n, then it has 2n subsequences This is an exponential-time algorithm!

  8. LCS 8 A Dynamic-Programming Approach to the LCS Problem Define L[i,j] to be the length of the longest common subsequence of X[0..i] and Y[0..j]. Allow for -1 as an index, so L[-1,k] = 0 and L[k,-1]=0, to indicate that the null part of X or Y has no match with the other. Then we can define L[i,j] in the general case as follows: 1. If xi=yj, then L[i,j] = L[i-1,j-1] + 1 (we can add this match) 2. If xi yj, then L[i,j] = max{L[i-1,j], L[i,j-1]} (we have no match here) Case 1: Case 2:

  9. An LCS Algorithm Algorithm LCS(X,Y ): Input: Strings X and Y with n and m elements, respectively Output: For i = 0, ,n-1, j = 0,...,m-1, the length L[i, j] of a longest string that is a subsequence of both the string X[0..i] = x0x1x2 xi and the string Y [0.. j] = y0y1y2 yj for i =1 to n-1 do L[i,-1] = 0 for j =0 to m-1 do L[-1,j] = 0 for i =0 to n-1 do for j =0 to m-1 do if xi = yjthen L[i, j] = L[i-1, j-1] + 1 else L[i, j] = max{L[i-1, j] , L[i, j-1]} return array L

  10. Visualizing the LCS Algorithm

  11. Analysis of LCS Algorithm We have two nested loops The outer one iterates n times The inner one iterates m times Thus, the total running time is O(nm) For example, if n = 1 million and m = 1 million, and we can do ops in 1 ns, then we need approx. 17 minutes. Answer is contained in L[n,m] (and the subsequence can be recovered from the L table). This takes O(nm) space, however! For example, if n = 1 million and m = 1 million, then we need over 1 TB of memory for the L table!

  12. Hirschbergs Algorithm Hirschberg s algorithm is a way of computing the LCS in linear space. Dan Hirschberg is a faculty member in the Department of Computer Science at University of California, Irvine. He teaches Algorithms and Data Compression in the MCS program.

  13. Idea #1 When we are computing the L table, we only need to store the previous row and the current row. If we just want to know the last row of the LCS (e.g., to compute the length of the LCS), this is enough to achieve linear space.

  14. Review of the LCS Algorithm To match Hirschberg s notation, let A be string of length m and B be string of length n (both indexed starting from 1).

  15. Applying Idea #1 Use a 2 x n matrix, K, instead of L. Let A be string of length m and B be string of length n (indexed starting from 1). The following algorithm gives the last line, LL, of the L table:

  16. Idea #2: Divide and Conquer Algorithm Where does the LCS path cross the middle column of the L table? For i=m/2, find the place, k, where the LCS path crosses the middle column using Algorithm B. Then recursively find the LCS for A[1,i] and B[1,k] and the LCS for A[i+1,m] and B[k+1,n].

  17. Notation

  18. Algorithm Analysis Space is O(n + m). Time: T(m,n) = T(m/2, j) + T(m/2, n-j) + cnm

  19. Prove by induction that T(m,n) <= 2cmn In class

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#