Multiple Sequence Alignment with Hidden Markov Models

 
M
a
k
i
n
g
 
A
 
M
u
l
t
i
p
l
e
 
S
e
q
u
e
n
c
e
 
A
l
i
g
n
m
e
n
t
 
w
i
t
h
H
i
d
d
e
n
 
M
a
r
k
o
v
 
M
o
d
e
l
s
 
O
c
t
 
2
6
t
h
,
 
2
0
2
3
C
h
e
n
g
z
e
 
S
h
e
n
 
1
 
Outline
 
2
 
Multiple Sequence Alignment
a.
Background
b.
Profile Hidden Markov Models
Getting An Alignment - Usage of Profile HMMs
a.
Single Profile HMM
b.
UPP
c.
WITCH
Beyond profile HMMs - Usage of Progressive Alignment
a.
EMMA
 
Background - Multiple Sequence Alignment
 
3
 
The input to Multiple Sequence Alignment (MSA) is a set of unaligned sequences (e.g., strings that
consist of letters of nucleotides or amino acids).
The output is an alignment of the sequences, adding gaps if necessary.
 
Background - Multiple Sequence Alignment
 
4
 
To measure the accuracy of an MSA (compared to reference)
Sum-of-pairs false negatives (SPFN) → fraction of missed true homologies
Sum-of-pairs false positives (SPFP) → fraction of falsely recovered homologies
 
Background - Multiple Sequence Alignment
 
5
 
MSA is a crucial step for many downstream biological analyses, such as
a.
Phylogeny estimation, detecting and quantifying selection, etc.
b.
Taxonomic identification/Abundance profiling
Classic methods (e.g., MAFFT, MUSCLE, and Clustal-Omega).
M
o
r
e
 
r
e
c
e
n
t
 
m
e
t
h
o
d
s
 
(
e
.
g
.
,
 
S
A
T
e
,
 
S
A
T
e
-
I
I
,
 
P
A
S
T
A
 
a
n
d
 
M
A
G
U
S
)
.
a.
Divide-and-conquer
b.
Align sub-problems with an accurate method (e.g., MAFFT-linsi)
c.
Merge sub-problems to form the final alignment
d.
S
c
a
l
a
b
l
e
 
t
o
 
h
u
n
d
r
e
d
s
 
o
f
 
t
h
o
u
s
a
n
d
s
 
o
f
 
s
e
q
u
e
n
c
e
s
e.
A
c
c
u
r
a
t
e
 
w
h
e
n
 
r
a
t
e
s
 
o
f
 
e
v
o
l
u
t
i
o
n
 
a
r
e
 
h
i
g
h
 
Background - Multiple Sequence Alignment
 
H
o
w
e
v
e
r
,
 
o
n
l
y
 
s
o
m
e
 
m
e
t
h
o
d
s
 
t
r
y
 
t
o
 
d
e
a
l
 
w
i
t
h
 
M
S
A
w
i
t
h
 
s
e
q
u
e
n
c
e
 
l
e
n
g
t
h
 
h
e
t
e
r
o
g
e
n
e
i
t
y
.
Often appearing in biological datasets
Also, if aligning reads/fragmentary sequences
A
d
d
i
t
i
o
n
a
l
l
y
,
 
m
a
n
y
 
e
x
i
s
t
i
n
g
 
m
e
t
h
o
d
s
 
d
o
 
n
o
t
 
s
c
a
l
e
 
t
o
a
l
i
g
n
i
n
g
 
m
a
n
y
 
s
e
q
u
e
n
c
e
s
.
E.g., aligning one million sequences
General attempt to solve both limitations (
two-stage
approach)
Align some sequences with an accurate method
Add in the remaining sequences
 
6
 
Figure 1 from the WITCH (Shen et al. 2023). Histograms of
sequence lengths and x-axis refers to number of nucleotides
(bp). 1000M1 is a simulated dataset; remaining datasets are
biological and from Comparative Ribosome Website (CRW).
 
Background - Profile Hidden Markov Models
 
7
 
Alignment as a profile HMM
 
Figure credit: EMBL-EBI
s
t
 
Given a profile HMM 
H
 and an unaligned sequence
q
, we can:
Find the most probable path of q through H
(Viterbi algorithm)
Find the probability that H generates q
(Forward algorithm)
 
Getting An Alignment - Use A Single Profile HMM
 
8
 
(Recap) Multiple sequence alignment in two stages:
a.
Select a subset of sequences to be aligned (any method)
b.
Add in the remaining sequences
Simplest idea:
a.
U
s
e
 
a
 
s
i
n
g
l
e
 
p
r
o
f
i
l
e
 
H
M
M
 
t
o
 
r
e
p
r
e
s
e
n
t
 
t
h
e
 
s
u
b
s
e
t
 
a
l
i
g
n
m
e
n
t
b.
Add in the remaining sequences to the subset alignment
by computing their most probable paths through the profile HMM
 
AAAAATTTTTTAAAAA
AAAAA----TTAAAAA
AACAT----TTAAAA-
 
-----
AAAAATTTTTTAAAAA
-
-----
AAAAA----TTAAAAA
-
-----
AACAT----TTAAAA-
-
ccgga
-AAAATTTT-------
t
pHMM
 
CCGGAAAAATTTTT
 
New sequence
 
iiiiidmmmmmmmmdddddddi
 
Subset alignment
 
Getting An Alignment - UPP
 
9
 
A potential issue with using a single pHMM
If an alignment has highly varied sequences (high rates of evolution), then the produced
pHMM may poorly represent the alignment.
Solution (UPP, Nguyen et al. 2015):
Instead of creating a single pHMM, we break the alignment into smaller subsets
Create a pHMM for each of the smaller subsets
For each new sequence q:
Find the most fitting pHMM that generates q and align it
 
Getting An Alignment - UPP
 
10
 
1.
Separate unaligned sequences into two sets
a.
A
 
s
u
b
s
e
t
 
o
f
 
f
u
l
l
-
l
e
n
g
t
h
 
s
e
q
u
e
n
c
e
s
 
a
s
 
b
a
c
k
b
o
n
e
b.
R
e
m
a
i
n
i
n
g
 
s
e
q
u
e
n
c
e
s
 
a
s
 
q
u
e
r
i
e
s
2.
Estimate backbone alignment and tree
3.
Construct an ensemble of pHMMs (eHMM) to model the backbone alignment
4.
A
s
s
i
g
n
 
e
a
c
h
 
q
u
e
r
y
 
s
e
q
u
e
n
c
e
 
t
o
 
a
 
s
i
n
g
l
e
 
b
e
s
t
 
p
H
M
M
 
a
c
c
o
r
d
i
n
g
 
t
o
 
b
i
t
 
s
c
o
r
e
s
a.
Then, align queries to their corresponding pHMMs
5.
Merge query alignments to backbone alignment with transitivity
 
Figure 2 from UPP.
 
Scoring metrics
: HMMER
(Finn et al. 2011) 
bit score
(the fitness between a HMM
H 
and a query sequence 
q
).
Core idea:
Representing the “backbone”
with multiple pHMMs
 
Getting An Alignment - UPP
 
11
 
SPFN
 
(Single pHMM)
 
Getting An Alignment - UPP
 
12
 
Avg. over (1) 12.5%, (2) 25%, (3) 50% fragmentation
 
Getting An Alignment - WITCH
 
W
I
T
C
H
 
(
S
h
e
n
 
e
t
 
a
l
.
 
2
0
2
2
b
)
 
i
m
p
r
o
v
e
s
 
u
p
o
n
 
U
P
P
u
s
e
s
 
a
 
d
i
f
f
e
r
e
n
t
 
c
r
i
t
e
r
i
o
n
 
t
o
 
a
s
s
i
g
n
 
q
u
e
r
i
e
s
 
t
o
 
H
M
M
s
 
(
w
e
i
g
h
t
e
d
 
b
i
t
 
s
c
o
r
e
s
)
u
s
e
s
 
m
u
l
t
i
p
l
e
 
H
M
M
s
 
i
n
s
t
e
a
d
 
o
f
 
o
n
e
 
H
M
M
 
t
o
 
a
l
i
g
n
 
e
a
c
h
 
q
u
e
r
y
 
s
e
q
u
e
n
c
e
 
13
 
New scoring metrics
:
weighted bit scores.
WITCH
 
Improving the UPP Framework - WITCH
 
14
 
1.
UPP selects the best scoring HMM model to align a sequence by bit scores.
> Concern
 - Bit scores do not correctly reflect probabilities of query generation.
> Improvement idea
 
- replace bit scores with a more statistically rigorous
calculation.
 
1.
UPP utilizes one best HMM model to align a sequence.
> Concern
 
- the eHMM is underutilized, and there might be additional alignment
information UPP ignores.
> Improvement idea
 - align a sequence with 
multiple high-ranked HMMs
 and find
a consensus alignment.
 
H
1
 
H
2
 
H
x
 
q generated by H
x
 
Selected as the best HMM
 
pHMM Weighting
 
15
 
1.
             is the 
probability 
that HMM 
H
i
  generates sequence q.
 
1.
Notice the 
impact of the number of sequences of the alignment 
that the HMM is built upon.
 
s
i
 is the number of sequences in the
alignment that generates H
i
 
bit score between H
i
 and q
 
The definition of 
bit score
 is:
 
We define:
HMM Weighting Derivation
16
(Bayes’ Rule)
 
Finding A Consensus
 
How to use the weighted pHMMs to align a sequence q?
We can find a consensus alignment of q by combining alignments from multiple pHMMs:
For a set of pHMMs {H1, H2, …, Hn}:
 
17
 
Alignment(q) = sum
 
Alignment(q|H1)*w(H1,q)
 
Alignment(q|H2)*w(H2,q)
 
Alignment(q|H3)*w(H3,q)
 
 
Alignment(q|Hn)*w(Hn,q
)
 
*In reality, only a few pHMMs
may have impactful weights
 
An Example of Consensus - Using 2 pHMMs
 
18
 
Figure 3 from WITCH. An example of how WITCH
combines multiple query-HMM alignments to
form a consensus.
 
The alignments can be merged by:
1.
G
r
a
p
h
 
C
l
u
s
t
e
r
i
n
g
 
M
e
r
g
e
r
 
(
S
m
i
r
n
o
v
 
a
n
d
W
a
r
n
o
w
 
2
0
2
1
)
.
2.
Maximum Weight Trace (Smirnov et al.
2022, Liu and Warnow 2023).
 
WITCH results - Datasets
 
19
 
* p-distance is the
normalized Hamming
distance
 
Testing datasets
 
WITCH results - High Fragmentary Datasets
 
20
 
H
i
g
h
 
f
r
a
g
m
e
n
t
a
t
i
o
n
:
 
5
0
%
 
s
e
q
u
e
n
c
e
s
 
a
r
e
m
a
d
e
 
t
o
 
f
r
a
g
m
e
n
t
s
 
o
f
 
~
2
5
%
 
o
r
i
g
i
n
a
l
 
l
e
n
g
t
h
s
.
Error: average of SPFN and SPFP
M
A
G
U
S
+
U
P
P
UPP when the “backbone” is aligned
with MAGUS
 
WITCH consistently improves on UPP on all
datasets in terms of alignment error
WITCH is the most accurate overall
WITCH also leads to more accurate
downstream tree estimations
 
Partial Figure 5 from WITCH. Alignment error and Tree estimation false
negative rates of methods.
 
WITCH results - High Fragmentary Datasets
 
21
 
H
i
g
h
 
f
r
a
g
m
e
n
t
a
t
i
o
n
:
 
5
0
%
 
s
e
q
u
e
n
c
e
s
 
a
r
e
m
a
d
e
 
t
o
 
f
r
a
g
m
e
n
t
s
 
o
f
 
~
2
5
%
 
o
r
i
g
i
n
a
l
 
l
e
n
g
t
h
s
.
Error: average of SPFN and SPFP
 
Same story for the other model conditions
a.
But with lower rates of evolution (left to
right), differences between methods
are smaller
 
Figure 5 from WITCH. Alignment error (
top
, average of SPFN and SPFP) and tree
false negative rate (
bottom
, i.e., missing bipartitions) of WITCH and other
methods on all sequences of simulated datasets with high fragmentation.
 
WITCH results - High Fragmentary Datasets
 
22
 
Zoomed-in comparison between WITCH
and UPP
“MAGUS” means the backbone alignment
method for both UPP and WITCH
 
More difficult to align
 
WITCH results - High Fragmentary Datasets
 
23
 
Partial table from Table S7 from WITCH. SPFN/SPFP of WITCH and MAGUS+UPP on high fragmentary simulated datasets,
induced on just the fragmentary sequences.
 
The main advantage of WITCH to MAGUS+UPP comes from noticeably lower SPFN, at a cost
of slightly higher SPFP.
 
Conclusions - WITCH
 
24
 
WITCH builds upon the UPP framework and introduces two algorithmic innovations:
a.
HMM weighting
b.
query consensus alignment with multiple weighted HMMs
WITCH has improved alignment accuracy than UPP
WITCH alignments yield more accurate maximum-likelihood tree estimation
 
Beyond profile HMM - Its Limitation
 
25
 
pHMM-based method
Advantage
Fast and scales linearly to the number of queries
Disadvantage
Cannot find homologous pairs between queries that don’t go through pHMM match states
E.g., insertion states from two different aligned queries will never be aligned together
 
Beyond profile HMMs - Other Ways to Add Sequences
 
26
 
I
n
p
u
t
s
:
 
a
n
 
e
x
i
s
t
i
n
g
 
a
l
i
g
n
m
e
n
t
 
a
n
d
 
a
 
s
e
t
 
o
f
 
u
n
a
l
i
g
n
e
d
 
q
u
e
r
y
 
s
e
q
u
e
n
c
e
s
O
u
t
p
u
t
:
 
a
n
 
a
l
i
g
n
m
e
n
t
 
o
n
 
a
l
l
 
s
e
q
u
e
n
c
e
s
 
Progressive Alignment When Adding New Sequences
 
27
 
Credit: 
Sandra Baldauf
 
A: AATTCG...
B: ATTTCC...
D: AAT-CG...
E: -A--CG...
G: ...
H: ...
K: ...
 
Existing alignment
 
New unaligned sequences
 
(known)
 
(aligning C)
 
(known)
 
C: AATCC...
F: ACGG...
I: TTTCCG..
J: TTCCG...
 
1.
Distance matrix to get a guide tree on ALL sequences
2.
When merging two neighbor nodes:
a.
If 
a new sequence
 is involved in at least one node and
it is not aligned to the existing alignment yet, perform
alignment (   )
b.
Otherwise, no alignment is needed (using existing
alignment)
 
(aligning F)
 
(aligning I+J)
 
(known)
 
(aligning IJ)
 
(known)
 
(known)
 
(known)
 
Scaling Issue with MAFFT-linsi
 
28
 
MAFFT-linsi --add more accurate than default MAFFT --add, but slower and less scalable.
 
Backbone: 1000-sequence alignment (dataset is INDELible 5000M2 averaged over 9 replicates).
 
MAFFT-linsi
failures due
to OOM.
 
EMMA pipeline
 
29
 
1.
B
u
i
l
d
 
a
 
t
r
e
e
 
o
n
 
t
h
e
 
e
x
i
s
t
i
n
g
 
a
l
i
g
n
m
e
n
t
 
(
s
u
b
s
e
t
a
l
i
g
n
m
e
n
t
)
.
2.
Decompose the tree into subsets as in PASTA.
subsets have induced sub-alignments
3.
S
e
l
e
c
t
 
s
u
b
s
e
t
s
 
w
i
t
h
 
L
 
t
o
 
U
 
s
e
q
u
e
n
c
e
s
.
4.
Construct HMMs on alignments of these 
selected subsets.
 
Modified Figure 1 from PASTA.
 
EMMA pipeline
 
30
 
1.
B
u
i
l
d
 
a
 
t
r
e
e
 
o
n
 
t
h
e
 
e
x
i
s
t
i
n
g
 
a
l
i
g
n
m
e
n
t
 
(
s
u
b
s
e
t
a
l
i
g
n
m
e
n
t
)
.
2.
Decompose the tree into subsets as in PASTA.
subsets have induced sub-alignments
3.
S
e
l
e
c
t
 
s
u
b
s
e
t
s
 
w
i
t
h
 
L
 
t
o
 
U
 
s
e
q
u
e
n
c
e
s
.
4.
Construct HMMs on alignments of these selected subsets.
5.
Assign query sequences to 
subsets 
based on adjusted
HMM bit-scores.
6.
A
l
i
g
n
 
q
u
e
r
i
e
s
 
t
o
 
a
l
i
g
n
m
e
n
t
s
 
o
f
 
t
h
e
 
a
s
s
i
g
n
e
d
 
s
u
b
s
e
t
s
 
w
i
t
h
M
A
F
F
T
-
l
i
n
s
i
 
-
-
a
d
d
 
(
p
r
o
g
r
e
s
s
i
v
e
 
a
l
i
g
n
m
e
n
t
 
i
n
s
t
e
a
d
 
o
f
p
H
M
M
s
)
.
7.
Merge all extended sub-alignments using transitivity.
 
Modified Figure 1 from PASTA.
 
Query sequences
 
EMMA - experiment results
 
31
 
Backbone selection:
10% of all sequences, up to 1000
randomly selected sequences
Remaining are queries
Biological datasets
“X” - failed to run
 
All methods generally have higher error
on datasets with higher rates of
evolution
EMMA is the most accurate method
 
# sequences
 
320 - 807
 
6,323
 
7,350
 
27,643
 
96,773
 
186,802
 
EMMA - experiment results
 
32
 
Backbone selection:
10% of all sequences, up to 1000
randomly selected sequences
Remaining are queries
Simulated datasets
“X” - failed to run
 
Similar story to biological data
Bigger accuracy advantage for EMMA on
datasets with high rates of evolution
 
# sequences
 
1000
 
1000
 
1000
 
1000
 
5000
 
5000
 
10,000
 
Conclusions - EMMA
 
33
 
EMMA successfully scales MAFFT-linsi --add to run on much larger datasets, including those up
to ~186,000 sequences (Res)
It usually has the best accuracy compared to WITCH-ng and even just MAFFT-linsi --add (when it
can run)
C
o
m
p
a
r
i
n
g
 
t
o
 
H
M
M
-
b
a
s
e
d
 
m
e
t
h
o
d
s
,
 
E
M
M
A
 
c
a
n
 
f
i
n
d
 
h
o
m
o
l
o
g
i
e
s
 
b
e
t
w
e
e
n
 
a
d
d
e
d
 
s
e
q
u
e
n
c
e
s
Slide Note
Embed
Share

Multiple Sequence Alignment (MSA) is essential for various biological analyses like phylogeny estimation and selection quantification. Profile Hidden Markov Models (HMMs) play a crucial role in achieving accurate alignments. This process involves aligning unaligned sequences to create alignments with gaps if needed, ensuring accuracy by measuring parameters like Sum-of-pairs false negatives and false positives. Various methods such as MAFFT, MUSCLE, and newer approaches like SATe and PASTA are used for MSA, addressing challenges like sequence length heterogeneity. The use of profile HMMs allows for precise alignment through algorithms like Viterbi and Forward.

  • Multiple Sequence Alignment
  • Hidden Markov Models
  • Bioinformatics
  • Phylogeny Estimation
  • Profile HMMs

Uploaded on May 18, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Making A Multiple Sequence Alignment with Hidden Markov Models Oct 26th, 2023 Chengze Shen 1

  2. Outline Multiple Sequence Alignment a. Background b. Profile Hidden Markov Models Getting An Alignment - Usage of Profile HMMs a. Single Profile HMM b. UPP c. WITCH Beyond profile HMMs - Usage of Progressive Alignment a. EMMA 2

  3. Background - Multiple Sequence Alignment The input to Multiple Sequence Alignment (MSA) is a set of unaligned sequences (e.g., strings that consist of letters of nucleotides or amino acids). The output is an alignment of the sequences, adding gaps if necessary. 3

  4. Background - Multiple Sequence Alignment To measure the accuracy of an MSA (compared to reference) Sum-of-pairs false negatives (SPFN) fraction of missed true homologies Sum-of-pairs false positives (SPFP) fraction of falsely recovered homologies 4

  5. Background - Multiple Sequence Alignment MSA is a crucial step for many downstream biological analyses, such as a. Phylogeny estimation, detecting and quantifying selection, etc. b. Taxonomic identification/Abundance profiling Classic methods (e.g., MAFFT, MUSCLE, and Clustal-Omega). More recent methods (e.g., SATe, SATe-II, PASTA and MAGUS). a. Divide-and-conquer b. Align sub-problems with an accurate method (e.g., MAFFT-linsi) c. Merge sub-problems to form the final alignment d. Scalable to hundreds of thousands of sequences e. Accurate when rates of evolution are high 5

  6. Background - Multiple Sequence Alignment However, only some methods try to deal with MSA with sequence length heterogeneity. Often appearing in biological datasets Also, if aligning reads/fragmentary sequences Additionally, many existing methods do not scale to aligning many sequences. E.g., aligning one million sequences General attempt to solve both limitations (two-stage approach) Align some sequences with an accurate method Add in the remaining sequences Figure 1 from the WITCH (Shen et al. 2023). Histograms of sequence lengths and x-axis refers to number of nucleotides (bp). 1000M1 is a simulated dataset; remaining datasets are biological and from Comparative Ribosome Website (CRW). 6

  7. Background - Profile Hidden Markov Models Alignment as a profile HMM Given a profile HMM H and an unaligned sequence q, we can: Find the most probable path of q through H (Viterbi algorithm) Find the probability that H generates q (Forward algorithm) s t Figure credit: EMBL-EBI 7

  8. Getting An Alignment - Use A Single Profile HMM (Recap) Multiple sequence alignment in two stages: a. Select a subset of sequences to be aligned (any method) b. Add in the remaining sequences Simplest idea: a. Use a single profile HMM to represent the subset alignment b. Add in the remaining sequences to the subset alignment by computing their most probable paths through the profile HMM AAAAATTTTTTAAAAA AAAAA----TTAAAAA AACAT----TTAAAA- Subset alignment -----AAAAATTTTTTAAAAA- -----AAAAA----TTAAAAA- -----AACAT----TTAAAA-- ccgga-AAAATTTT-------t New sequence CCGGAAAAATTTTT pHMM iiiiidmmmmmmmmdddddddi 8

  9. Getting An Alignment - UPP A potential issue with using a single pHMM If an alignment has highly varied sequences (high rates of evolution), then the produced pHMM may poorly represent the alignment. Solution (UPP, Nguyen et al. 2015): Instead of creating a single pHMM, we break the alignment into smaller subsets Create a pHMM for each of the smaller subsets For each new sequence q: Find the most fitting pHMM that generates q and align it 9

  10. Getting An Alignment - UPP 1. Separate unaligned sequences into two sets a. A subset of full-length sequences as backbone b. Remaining sequences as queries Estimate backbone alignment and tree Construct an ensemble of pHMMs (eHMM) to model the backbone alignment Assign each query sequence to a single best pHMM according to bit scores a. Then, align queries to their corresponding pHMMs Merge query alignments to backbone alignment with transitivity 2. 3. 4. Core idea: Representing the backbone with multiple pHMMs 5. Scoring metrics: HMMER (Finn et al. 2011) bit score (the fitness between a HMM H and a query sequence q). Figure 2 from UPP. 10

  11. Getting An Alignment - UPP SPFN (Single pHMM) 11

  12. Getting An Alignment - UPP Avg. over (1) 12.5%, (2) 25%, (3) 50% fragmentation 12

  13. Getting An Alignment - WITCH WITCH (Shen et al. 2022b) improves upon UPP uses a different criterion to assign queries to HMMs (weighted bit scores) uses multiple HMMs instead of one HMM to align each query sequence WITCH New scoring metrics: weighted bit scores. 13

  14. Improving the UPP Framework - WITCH 1. UPP selects the best scoring HMM model to align a sequence by bit scores. > Concern - Bit scores do not correctly reflect probabilities of query generation. > Improvement idea - replace bit scores with a more statistically rigorous calculation. 1. UPP utilizes one best HMM model to align a sequence. > Concern - the eHMM is underutilized, and there might be additional alignment information UPP ignores. > Improvement idea - align a sequence with multiple high-ranked HMMs and find a consensus alignment. q generated by Hx H1 Hx H2 Selected as the best HMM 14

  15. pHMM Weighting Bit score Given Hi, how likely it generates q The definition of bit score is: Given a null/random model, how likely it generates q We define: siis the number of sequences in the alignment that generates Hi bit score between Hiand q 1. is the probability that HMM Higenerates sequence q. 1. Notice the impact of the number of sequences of the alignment that the HMM is built upon. 15

  16. Finding A Consensus How to use the weighted pHMMs to align a sequence q? We can find a consensus alignment of q by combining alignments from multiple pHMMs: For a set of pHMMs {H1, H2, , Hn}: Alignment(q|H1)*w(H1,q) Alignment(q|H2)*w(H2,q) *In reality, only a few pHMMs may have impactful weights Alignment(q|H3)*w(H3,q) Alignment(q) = sum Alignment(q|Hn)*w(Hn,q ) 17

  17. An Example of Consensus - Using 2 pHMMs Figure 3 from WITCH. An example of how WITCH combines multiple query-HMM alignments to form a consensus. The alignments can be merged by: Graph Clustering Merger (Smirnov and 1. Warnow 2021). 2. Maximum Weight Trace (Smirnov et al. 2022, Liu and Warnow 2023). 18

  18. WITCH results - Datasets * p-distance is the normalized Hamming distance Testing datasets 19

  19. WITCH results - High Fragmentary Datasets High fragmentation: 50% sequences are made to fragments of ~25% original lengths. Error: average of SPFN and SPFP MAGUS+UPP UPP when the backbone is aligned with MAGUS WITCH consistently improves on UPP on all datasets in terms of alignment error WITCH is the most accurate overall Partial Figure 5 from WITCH. Alignment error and Tree estimation false negative rates of methods. WITCH also leads to more accurate downstream tree estimations 20

  20. WITCH results - High Fragmentary Datasets High fragmentation: 50% sequences are made to fragments of ~25% original lengths. Error: average of SPFN and SPFP Same story for the other model conditions a. But with lower rates of evolution (left to right), differences between methods are smaller Figure 5 from WITCH. Alignment error (top, average of SPFN and SPFP) and tree false negative rate (bottom, i.e., missing bipartitions) of WITCH and other methods on all sequences of simulated datasets with high fragmentation. 21

  21. Conclusions - WITCH WITCH builds upon the UPP framework and introduces two algorithmic innovations: a. HMM weighting b. query consensus alignment with multiple weighted HMMs WITCH has improved alignment accuracy than UPP WITCH alignments yield more accurate maximum-likelihood tree estimation 24

  22. Beyond profile HMM - Its Limitation pHMM-based method Advantage Disadvantage Fast and scales linearly to the number of queries Cannot find homologous pairs between queries that don t go through pHMM match states E.g., insertion states from two different aligned queries will never be aligned together 25

  23. Beyond profile HMMs - Other Ways to Add Sequences Inputs: an existing alignment and a set of unaligned query sequences Output: an alignment on all sequences Method pHMM? Pros Cons MAFFT --add (progressive alignment) No Very fast, does not use HMMs Less accurate and memory hungry with a lot of sequences MAFFT-linsi --add (progressive alignment) No Accurate, does not use HMMs Very slow, memory hungry, and not scalable with more than 1000 sequences UPP Yes Accurate, fast and scalable to many sequences Query sequences are added independent to each other WITCH and WITCH-ng (same as WITCH but faster) Yes More accurate than UPP, also scalable to many sequences Same as UPP 26

  24. Scaling Issue with MAFFT-linsi MAFFT-linsi --add more accurate than default MAFFT --add, but slower and less scalable. MAFFT-linsi failures due to OOM. Backbone: 1000-sequence alignment (dataset is INDELible 5000M2 averaged over 9 replicates). 28

  25. EMMA pipeline 1. Build a tree on the existing alignment (subset alignment). Decompose the tree into subsets as in PASTA. subsets have induced sub-alignments Select subsets with L to U sequences. Construct HMMs on alignments of these selected subsets. 2. 3. 4. Modified Figure 1 from PASTA. 29

  26. EMMA pipeline 1. Build a tree on the existing alignment (subset alignment). Decompose the tree into subsets as in PASTA. subsets have induced sub-alignments Select subsets with L to U sequences. Construct HMMs on alignments of these selected subsets. Assign query sequences to subsets based on adjusted HMM bit-scores. Align queries to alignments of the assigned subsets with MAFFT-linsi --add (progressive alignment instead of pHMMs). Merge all extended sub-alignments using transitivity. 2. 3. 4. 5. 6. Modified Figure 1 from PASTA. 7. Query sequences 30

  27. EMMA - experiment results Backbone selection: 10% of all sequences, up to 1000 randomly selected sequences Remaining are queries Biological datasets X - failed to run # sequences 320 - 807 6,323 7,350 27,643 96,773 186,802 All methods generally have higher error on datasets with higher rates of evolution EMMA is the most accurate method 31

  28. EMMA - experiment results Backbone selection: 10% of all sequences, up to 1000 randomly selected sequences Remaining are queries Simulated datasets X - failed to run # sequences 1000 1000 1000 1000 5000 5000 10,000 Similar story to biological data Bigger accuracy advantage for EMMA on datasets with high rates of evolution 32

  29. Conclusions - EMMA EMMA successfully scales MAFFT-linsi --add to run on much larger datasets, including those up to ~186,000 sequences (Res) It usually has the best accuracy compared to WITCH-ng and even just MAFFT-linsi --add (when it can run) Comparing to HMM-based methods, EMMA can find homologies between added sequences 33

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#