Multiple Sequence Alignment Methods and Motivation

 
Mulitiple
Sequence
Alignment
 
Lecture – 6
 
Department of CSE, DIU
 
CONTENTS
 
1.
Multiple Sequence Alignment (MSA)
                          - Why Multiple Sequence Alignment
 
- Multiple Sequence Alignment: Motivation
 
- MSA Challenges
2.
Multiple Sequence Alignment Methods
 
- Dynamic Method: Impractical
 
-Greedy Method
 
- Progressive Method
 
- Iterative Method
 
1. Multiple Sequence Alignment
(MSA)
 
Why Multiple sequence, Motivation, Challanges
 
Multiple Sequence Alignment
 
A multiple sequence alignment (MSA) is a sequence
alignment of three or more biological sequences, generally
protein, DNA, or RNA. In many cases, the input set of query
sequences are assumed to have an evolutionary relationship
by which they share a linkage and are descended from a
common ancestor.
 
V
T
 
I
 
S
 
C
 
T
 
G
 
S
 
S
 
S
 
N
 
I
 
G
V
T
 
L
T
 
C
 
T
 
G
 
S
 
S
 
S
 
N
 
I
 
G
V
T
 
L
S
 
C
 
S
 
S
 
S
 
G
 
F
 
I
 
 
F
 
S
V
T
 
L
T
 
C
 
T
 
V
 
S
 
G
 
T
 
S
 
F
 
D
V
T
 
I
 
T
 
C
 
V
 
V
 
S
 
D
 
V
 
S
 
H
 
E
V
T
 
L
V
 
C
 
L
 
 
I
 
S
 
D
 
F
 
Y
 
P
 
G
V
T
 
L
V
 
C
 
L
 
 
I
 
S
 
D
 
F
 
Y
 
P
 
G
V
T
 
L
V
 
C
 
L
 
V
S
 
D
 
Y
 
F
 
P
 
E
 
Up until now we have only tried to align two sequences.
What about more than two? And what for?
A faint similarity between two sequences becomes significant if
present in many
Multiple alignments can reveal subtle similarities that pairwise
alignments do not reveal
 
V
T
 
I
 
S
 
C
 
T
 
G
 
S
 
S
 
S
 
N
 
I
 
G
V
T
 
L
T
 
C
 
T
 
G
 
S
 
S
 
S
 
N
 
I
 
G
V
T
 
L
S
 
C
 
S
 
S
 
S
 
G
 
F
 
I
 
 
F
 
S
V
T
 
L
T
 
C
 
T
 
V
 
S
 
G
 
T
 
S
 
F
 
D
V
T
 
I
 
T
 
C
 
V
 
V
 
S
 
D
 
V
 
S
 
H
 
E
V
T
 
L
V
 
C
 
L
 
 
I
 
S
 
D
 
F
 
Y
 
P
 
G
V
T
 
L
V
 
C
 
L
 
 
I
 
S
 
D
 
F
 
Y
 
P
 
G
V
T
 
L
V
 
C
 
L
 
V
S
 
D
 
Y
 
F
 
P
 
E
 
 
Why Multiple Sequence
alignment
 
Multiple
 
Sequenc
e
Alignment
:
 
Motivation
 
C
orrespondence
.
 
Find
 ou
t 
whic
h 
part
s “do
 
the
 
same
 
thing”
Simila
r 
gene
s 
ar
e conserved 
acros
s 
widel
y 
divergen
t
 
species, 
ofte
n
 
performin
g
 
similar functions
Structure
 
prediction
U
se knowledge
 
of structure of one or more members of a protein MSA to predict
 
structure of
other members
Structure
 
is more conserved than
 
sequence
Create
 
“profiles”
 
for
 
protein families
A
llow us to search for other members of the family
G
enom
e 
assembly
:
 
Automated reconstruction 
o
f
 “contig” maps
 
o
f
 
genomi
c fragments
such
 a
s 
ESTs
m
sa
 i
s the
 
starting
 poin
t 
for
 
phylogenetic
 
analysis
m
sa often allows 
to
 
detect weakly conserved
 
regions
 
which pairwise
 
alignment can’t
 
MSA Challanges
 
Computationally Expensive
If msa includes matches, mismatches and gaps and also
accounts the degree of variation then global msa can be
applied to only a few sequences
Difficult to score
Multiple comparison necessary in each column of the msa for a
cumulative score
Placement of gaps and scoring of substitution is more difficult
Difficulty increases with diversity
Relatively easy for a set of closely related sequences
Identifying the correct ancestry relationships for a set of distantly
related sequences is more challenging
Even difficult if some members are more alike compared to others
 
2. Multiple Sequence Alignment
    Methods
 
Dynamic Programming, Greedy, Progressive, Iterative
 
Global
 
MSA:
 
Dynamic
 
Programming
 
The two-sequence alignment algorithm (Needleman- Wunsch) can be
generalized to any number of sequences.
E.g., for three sequences X, Y, W define C[i,j,k] = score
of optimum alignment
among
 
X[1..i], Y[1..j], W[1..k]
As for two sequences, divide possible alignments into different classes,
depending on how they end.
Devise recurrence relations for C[i,j,k]
C[i,j,k] is the maximum out of all possibilities
 
MSA for 3 sequences: alignment can end in 7 ways
 
X
i
 
Y
j
W
k
 
X
i-1
Y
j-1
W
k-1
 
X
i
Y
j
W
k
-
Y
j
W
k
X
i
-
W
k
X
i
Y
j
-
-
-
W
k
-
Y
j
-
X
i
-
-
 
Alignin
g
 
Three
 Sequences
 
Same 
strategy as
aligning
 
two
 
sequences
Use 
a 3-D “Manhattan
Cube”,
 
with
 
each
 
axis
representing
 
a
 
sequence
to
 
align
 
V
 
W
 
2-D edit graph
 
3-D edit graph
 
V
 
W
 
X
 
Dynamic
 
programming 
fo
r
 
3
sequences
 
Each alignment is a path through the dynamic programming
matrix
 
V
 
S
 
N
 
 
S
S
 
N
 
A
 
 
 
A
 
S
 
V
 
S
 
N
 
S
 
A
N
S
 
Start
 
2-
D 
cell
 
versu
s
 
3
-
D 
Alignmen
t
 
Cell
 
I
n
 
3
-
D
,
 
7
 
e
d
g
e
s
i
n
 
e
a
c
h
 
u
n
i
t
 
c
u
b
e
 
I
n
 
2
-
D
,
 
3
 
e
d
g
e
s
i
n
 
e
a
c
h
 
u
n
i
t
s
q
u
a
r
e
 
C(i-1,j-1,k-1)
 
C(i-1,j,k-1)
 
C(i,j-1,k)
 
C(i-1,j-1,k)
 
C (i-1,j,k)
 
C(i,j,k)
 
C(i,j,k-1)
 
C(i,j-1,k-1)
 
Enumerate
 
all
 
possibilities
 
and
 
choose
 
th
e
 
bes
t 
one
 
C (i-1,j-1)
 
C
 (i-1,j)
 
C
 (i,j-1)
 
Dynamic
 
Programming msa:
General
 
Case
 
F
or 
k 
sequence
s
 o
f 
lengt
h 
n,
 
dynami
c 
programming algorith
m
doe
s 
(2
k
-1) n
k
 
operations
E
xample:
 
6 sequences
 
of
 length 100 require
 
6
.4X1
0
13 
calculations
S
pac
e for table
 i
s
 
n
k
 
Conclusion
: 
dynami
c 
programmin
g
 
approac
h for 
alignmen
t
betwee
n 
tw
o
 sequence
s
 i
s 
easil
y 
extended 
to 
k 
sequence
s
 bu
t 
i
t
 i
s
impractica
l
 
du
e
 t
o
 exponential runnin
g
 time
 
Greedy
 
Approach
:
 
Example
 
Conside
r 
thes
e
 
4
 sequences
 
 
s1 
 
GATTCA
s2
 
 
GTCTGA
s3
 
 
GATATT
s4
 
 
GTCAGC
 
T
her
e 
ar
e
 
6
 possibl
e 
alignments
 
s2
 
GTC
T
G
A
s4
 
GTC
A
G
C
 
(scor
e
 
=
 
2)
 
s1
 
G
A
T
-
T
C
A
s2
 
G
-
T
C
T
G
A
 
(scor
e
 
=
 1)
 
s1
 
GAT
-
T
C
A
s3
 
GAT
A
T
-
T
 
(score
 
s3
 
G
A
T
-
A
TT
=
 1
)
 
s4
 
G
-
T
C
A
GC
 
s
1
 
G
A
T
T
C
A
--
s
4
 
G
T-CA
GC(scor
e
 
=
 0)
 
s2
 
G
-
T
C
T
GA
s3
 
G
A
T
A
T
-T
 
(score
 
=
 -1)
 
(score
 
=
 -1)
 
Greedy
 
Approach
:
 
Example
 
s
2
 
an
d
 s
4
 
ar
e 
closest
;
 combine:
 
s2
 
GTC
T
G
A
s4
 
GTC
A
G
C
 
s
 
GT
C
t/
a
G
a/
c
 
2,4
 
(profile)
 
s
1
s
3
s
2,4
 
GATTCA
GATATT
GTC
t/
a
G
a/c
 
ne
w 
se
t
 
o
f
 3
 sequences:
 
Greedy
 
Approach
:
 
Example
 
Progressive Method
 
Two major steps – Guide Tree build up and
Multiple Pairwise Alignment
Steps
 
- Take each pair, align
 
- Generate consensus of that
 
   alignment
 
- Align new sequence with the
 
   consensus of the previous one
 
- Go back, Until all sequences are
 
   finished
Example
 
- Clustal 
ω
 
- MAFFT
 
- KALIGN
 
- T-COFFEE
 
Iterative Method
 
Works similarly to progressive
methods
Repeatedly realign the initial
sequences as well as add new
sequences to the growing MSA
Example
 
- DIALIGN
 
- MUSCLE
 
- POA
Slide Note
Embed
Share

Multiple Sequence Alignment (MSA) involves aligning three or more biological sequences to reveal evolutionary relationships and subtle similarities. Various methods like Dynamic, Greedy, Progressive, and Iterative approaches are used to overcome challenges in MSA. The motivation behind MSA includes finding similar functional parts in genes, predicting protein structures, genome assembly, and phylogenetic analysis initiation. MSA is crucial for identifying weakly conserved regions and automating genomic fragment reconstruction.

  • Multiple Sequence Alignment
  • MSA Methods
  • DNA
  • Protein
  • Evolution

Uploaded on Sep 14, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Mulitiple Sequence Alignment Lecture 6 Department of CSE, DIU

  2. CONTENTS 1. Multiple Sequence Alignment (MSA) -Why Multiple Sequence Alignment -Multiple Sequence Alignment: Motivation -MSA Challenges 2. Multiple Sequence Alignment Methods -Dynamic Method: Impractical -Greedy Method -Progressive Method -Iterative Method

  3. 1. Multiple Sequence Alignment (MSA) Why Multiple sequence, Motivation, Challanges

  4. Multiple Sequence Alignment V T I S C T G S S S N I G V T LT C T G S S S N I G V T LS C S S S G F I F S V T LT C T V S G T S F D V T I T C V V S D V S H E V T LV C L I S D F Y P G V T LV C L I S D F Y P G V T LV C L VS D Y F P E A multiple sequence alignment (MSA) is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. In many cases, the input set of query sequences are assumed to have an evolutionary relationship by which they share a linkage and are descended from a common ancestor.

  5. Why Multiple Sequence alignment Up until now we have only tried to align two sequences. What about more than two? And what for? A faint similarity between two sequences becomes significant if present in many Multiple alignments can reveal subtle similarities that pairwise alignments do not reveal V T I S C T G S S S N I G V T LT C T G S S S N I G V T LS C S S S G F I F S V T LT C T V S G T S F D V T I T C V V S D V S H E V T LV C L I S D F Y P G V T LV C L I S D F Y P G V T LV C L VS D Y F P E

  6. MultipleSequence Alignment:Motivation Correspondence. Find out which parts do thesamething Similar genes are conserved across widely divergentspecies, oftenperforming similar functions Structureprediction Use knowledge of structure of one or more members of a protein MSA to predict structure of other members Structureis more conserved than sequence Create profiles forprotein families Allow us to search for other members of the family Genome assembly:Automated reconstruction of contig maps ofgenomic fragments suchas ESTs msa is thestartingpoint forphylogenetic analysis msaoften allows todetect weakly conserved regionswhich pairwise alignment can t

  7. MSA Challanges Computationally Expensive If msa includes matches, mismatches and gaps and also accounts the degree of variation then global msa can be applied to onlya few sequences Difficult to score Multiple comparison necessary in each column of the msa for a cumulative score Placement of gaps and scoring of substitution is more difficult Difficulty increases with diversity Relatively easy for a set of closely related sequences Identifying the correct ancestry relationships for a set of distantly related sequences is more challenging Even difficult if some members are more alike compared to others

  8. 2. Multiple Sequence Alignment Methods Dynamic Programming, Greedy, Progressive, Iterative

  9. GlobalMSA:DynamicProgramming The two-sequence alignment algorithm (Needleman- Wunsch) can be generalized to any number of sequences. E.g., for three sequences X, Y, W define C[i,j,k] = score of optimum alignment among X[1..i], Y[1..j], W[1..k] As for two sequences, divide possible alignments into different classes, depending on how they end. Devise recurrence relations for C[i,j,k] C[i,j,k] is the maximum out of all possibilities

  10. MSA for 3 sequences: alignment can end in 7 ways Xi Yj Wk X1 . Y1 . W1 . . . Xi-1 Yj-1 Wk-1 . Xi Yj Wk . . . - Yj Wk Xi - Wk Xi - - Xi Yj - - Yj - - - Wk

  11. Aligning Three Sequences V V W W 2-D edit graph Same strategy as aligning two sequences Use a 3-D Manhattan Cube , with each axis representing a sequence to align X 3-D edit graph

  12. Dynamic programming for 3 sequences Each alignment is a path through the dynamic programming matrix V S N S S N A A S A N S V S N S Start

  13. 2-D cell versus 3-D Alignment Cell C (i-1,j) C (i-1,j-1) C(i-1,j-1,k-1) C(i-1,j,k-1) C (i-1,j,k) C(i-1,j-1,k) C (i,j-1) In 2-D, 3 edges in each unit square C(i,j-1,k-1) C(i,j,k-1) In 3-D, 7 edges in each unit cube C(i,j,k) C(i,j-1,k) Enumerate all possibilities and choose the best one

  14. Dynamic Programming msa: General Case For k sequencesof length n, dynamic programming algorithm does (2k-1) nkoperations Example: 6 sequences oflength 100 require 6.4X1013 calculations Space for table isnk Conclusion: dynamic programmingapproach for alignment between twosequencesis easily extended to k sequencesbut itis impracticalduetoexponential runningtime

  15. Greedy Approach: Example Consider these 4 sequences s1 GATTCA s2 GTCTGA s3 GATATT s4 GTCAGC

  16. Greedy Approach: Example There are 6 possible alignments s2 GTC s4 GTC s1 G GAT TTC CA A-- s4 G G T T- -CA GTCTG GA GTCAG GC (score = 2) CAGC(score = 0) s1 G GAT T-T TCA A s2 G G-T TCT TGA A (score = 1) s2 G G-T TCT TGA s3 G GAT TAT T-T (score = -1) s1 GAT s3 GAT s3 G GAT T-A ATT s4 G G-T TCA AGC GAT-T TCA A GATAT T-T T (score = 1) (score = -1)

  17. GreedyApproach:Example s2 and s4 are closest; combine: s2 GTC GTCTG GA s4 GTC GTCAG GC (profile) s 2,4 GTCt/aGa/c new set of 3 sequences: s1 s3 s2,4 GATTCA GATATT GTCt/aGa/c

  18. Progressive Method Two major steps Guide Tree build up and Multiple Pairwise Alignment Steps -Take each pair, align -Generate consensus of that alignment -Align new sequence with the consensus of the previous one -Go back, Until all sequences are finished Example -Clustal -MAFFT -KALIGN -T-COFFEE

  19. Iterative Method Works similarly to progressive methods Repeatedly realign the initial sequences as well as add new sequences to the growing MSA Example -DIALIGN -MUSCLE -POA

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#