Profile HMMs

Profile HMMs

Tandy Warnow

BioE/CS 598AGB

Profile Hidden Markov Models

•

Basic tool in sequence analysis

•

Look more complicated than they really are

•

Used to model a family of sequences

•

Can be built from a multiple sequence

alignment

•

Algorithms using profile HMMs are based on

dynamic programming (much like Needleman-

Wunsch)

Profile

•

Given a gap-less multiple sequence alignment,

we can build a profile describing what we see

•

S1 =  A C T A G

•

S2 =  A C A A G

•

S3 =  A T T  T G

•

S4 =  G T T  C G

             1       2          3          4       5

   A    0.75    0.0     0.25    0.50   0.0

   C    0.00     0.5    0.00    0.25   0.0

   T    0.00     0.5    0.75    0.25    0.0

   G    0.25    0.0     0.00    0.00   1.0

Using a profile

•

             1       2          3          4       5

•

   A    0.75    0.0     0.25    0.50   0.0

•

   C    0.00     0.5    0.00    0.25   0.0

•

   T    0.00     0.5    0.75    0.25    0.0

•

   G    0.25    0.0     0.00    0.00   1.0

End

End

0.75

0.0

0.0

0.25

0.0

0.5

0.5

0.0

0.25

0.0

0.75

0.0

0.50

0.25

0.25

0.0

0.0

0.0

0.0

1.0













Start

1.0

1.0

1.0

1.0

1.0

1.0

The profile yields a probability distribution of sequences – here, all of the

same length.

Adding in insertions

•

The profile shown in the previous slide only

had *match* states (indicated by rectangles).

It doesn’t allow any insertions or deletions.

•

To model indels, we just have to add

additional states to the graphical model.

–

Insertion states: Diamonds (have non-zero

emission probabilities)

–

Deletion states: Circles (nothing emitted)

From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

Building Profile HMMs

•

Profile HMMs can be built from a given

multiple sequence alignment – this is not too

difficult.

•

Profile HMMs can also be built from unaligned

sequences. This is a bit complicated.

Using Profile HMMs

•

Given a Profile HMM computed for a multiple

sequence alignment, you can use it to

–

Recognize related sequences

–

Add related sequences into the multiple sequence

alignment

•

See PFAM,

http://pfam.xfam.org

, for how

profile HMMs are used to represent groups of

functionally and structurally related proteins.

Using profile HMMs to align sequences

•

Given an MSA for a set

 of representative

sequences for a gene and set

of additional

sequences (query sequences)

–

You build a profile HMM for the MSA on S.

–

For each

s in X

 you find for the gene, you find the

path

through the profile HMM that is most likely to

generate

–

The

path

 specifies how to add the sequence into the

MSA (only the match states count).

–

Transitivity gives you the final MSA after you add in all

the other sequences.

From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

Other uses of profile HMMs

•

Given two profile HMMs (H1 and H2), and a

sequence s, you can determine which one is more

likely to generate s (again, using dynamic

programming).

•

Note that computing the probability that a profile

HMM generates a sequence requires calculating

the probability for *every* path and adding up

the probabilities. This can be calculated in

polynomial time, using dynamic programming.

Applications of profile HMMs

•

Recognizing membership in protein families

(see PFAM http://pfam.xfam.org)

•

Progressive multiple sequence alignment

methods, such as SATCHMO (Edgar and

Sjolander, Bioinformatics 2003)

•

Phylogenetic placement (e.g., SEPP, Mirarab et

al., PSB 2012)

•

Taxonomic identification of metagenomic data

(e.g., TIPP, Nguyen et al., Bioinformatics 2014)

Another application: UPP

•

UPP (Nguyen et al., RECOMB 2015) is a

method for ultra-large multiple sequence

alignment:

–

Up to 1,000,000 sequences

–

Robust to fragmentary sequences

•

Nam-phuong Nguyen (IGB postdoctoral

fellow) will present this method on Tuesday.

HMMER

•

http://hmmer.janelia.org

•

One of the most popular collection of tools to

perform analyses based on profile HMMs.

•

HMMER web server: interactive sequence

similarity searching, NAR 2011,

http://nar.oxfordjournals.org/content/39/sup

pl_2/W29

Supertree Estimation

•

Purposes:

–

Divide-and-conquer tree estimation

–

Combining analyses performed by other research

groups

–

Tree of Life

Supertree Estimation

Challenges:

•

Tree compatibility is NP-complete (therefore, even if

subtrees are correct, supertree estimation is hard)

•

Estimated subtrees have error

Advantages:

•

Estimating individual gene trees can be

computationally feasible (compared to the combined

analysis of many genes)

•

Can use different types of data for each gene

Many Supertree Methods

•

MRP

•

weighted MRP

•

MRF

•

MRD

•

Robinson-Foulds

Supertrees

•

Min-Cut

•

Modified Min-Cut

•

Semi-strict Supertree

•

QMC

•

Q-imputation

•

SDM

•

PhySIC

•

Majority-Rule Supertrees

•

Maximum Likelihood

Supertrees

•

and many more ...

MRP and MRL

•

MRP (Matrix Representation with Parsimony)

•

MRL (Matrix Representation with Likelihood)

–

The input set of source trees is represented by a

binary matrix with ?s for missing data

–

Each edge in each source tree gives you one

column in the matrix (based on the bipartition it

defines on its leafset)

–

Then you analyze the matrix using parsimony or

CFN maximum likelihood

Single gene vs. multi-gene analyses

•

Most methods analyze

single

 genes (or other

genomic region).  These produce estimated

“

gene trees

”

•

But species trees are estimated using

multiple

genes.

Not all genes present in all species

Two competing approaches

Combined

 Analysis

Species

FN rate of MRP vs.

combined analysis

Scaffold Density (%)

SuperFine-boosting:

improves accuracy of

MRP

Scaffold Density (%)

(Swenson et al., Syst. Biol. 2012)

SuperFine

•

First, construct a supertree with low false

positives

The Strict

Consensus

•

Then, refine the tree to reduce false

negatives by resolving each polytomy using a

“

base

”

 supertree method (e.g., MRP)

Quartet Max Cut

Obtaining a supertree with low FP

The Strict Consensus Merger (SCM)

SCM of two trees

Computes the strict consensus on the common leaf

set

Then superimposes the two trees, contracting more

edges in the presence of

“

collisions

”

Strict Consensus Merger (SCM)

Theoretical results for SCM

•

SCM can be computed in polynomial time

•

For certain types of inputs, the SCM method

solves the NP-hard

“

Tree Compatibility

”

problem

•

All splits in the SCM

“

appear

”

 in at least one

source tree (and are not contradicted by any

source tree)

Performance of SCM

•

Low false positive (FP) rate

(Estimated supertree has few false edges)

•

High false negative (FN) rate

(Estimated supertree is missing many true edges)

Part II of SuperFine

•

Refine the tree to reduce false negatives

by resolving each polytomy using a base

supertree method (e.g., MRP)

Part 1 of SuperFine

Part 2 of SuperFine

Theorem

Given

–

a set of source trees,

–

SCM tree T,

–

and a polytomy in T,

after relabelling and reducing, each source tree

has

at most one leaf with each label

Step 2: Apply MRP to the collection of reduced

source trees

MRP

Replace polytomy using tree from MRP

Resolving a single polytomy,

v, using

MRP

•

Step 1: Reduce each source tree to a tree on

leafset, {1,2,...,

} where

=degree(

•

Step 2: Apply MRP to the collection of reduced

source trees, to produce a tree t on {1,2,...,

•

Step 3: Replace the star tree at

 by tree

SuperFine-boosting:

improves accuracy of

MRP

Scaffold Density (%)

(Swenson et al., Syst. Biol. 2012)

SuperFine is also much faster

MRP

  8-12 sec.

SuperFine

2-3 sec.

Scaffold Density (%)

Scaffold Density (%)

Scaffold Density (%)

Uses of Supertree Methods

•

DACTAL: divide-and-conquer trees almost without

alignments (Nelesen et al, 2012)

•

DCM1-boosting

In these methods, the dataset is divided into subsets

(using chordal graph theory), and then trees on the

subsets are combined (using supertree methods).

Proofs about the algorithm guarantees are established

using chordal graph theory.

Chordal graph algorithms yield phylogeny

estimation from

polynomial length

 sequences

•

Theorem (Warnow et

al., SODA 2001):

DCM1-NJ correct with

high probability given

sequences of length

O(ln n e

O(ln n)

•

Simulation study from

Nakhleh et al. ISMB

NJ

DCM1-NJ

No. Taxa

0.2

0.4

0.6

0.8

Error Rate

DACTAL: divide-and-conquer trees without alignment

(submitted)

New supertree method:

SuperFine

Existing Method:

RAxML(MAFFT)

pRecDCM3

BLAST-

based

Overlapping

subsets

A tree for

each subset

Unaligned

Sequences

A tree for the

entire dataset

é

CRW: Comparative RNA database, structural

alignments

3 datasets with 6,323 to 27,643 sequences

Reference trees: 75% RAxML bootstrap trees

DACTAL (shown in red) run for 5 iterations

starting from FT(Part)

SATé-1 fails on the largest dataset

SATé-2 runs but is not more accurate than

DACTAL, and takes longer

More divide-and-conquer

•

Recall that SATe and PASTA use divide-and-

conquer (and also iteration) to improve

alignment estimation.

•

Alignments on different subsets are merged

together using techniques like OPAL and Muscle.

•

However, alignments can also be merged

together using HMM-HMM alignment.

•

You should think of your own algorithmic designs

for improving scalability and accuracy, whether

for MSA or for tree estimation!

Slide Note

Embed Share

Download

Profile Hidden Markov Models (HMMs) are essential tools in sequence analysis. These models capture the complexity of sequence families and are based on dynamic programming algorithms. By building profiles from multiple sequence alignments, Profile HMMs provide a probability distribution of sequences, enabling the modeling of insertions and deletions. Learn how to recognize related sequences and expand alignments using Profile HMMs in bioinformatics applications.

swinger_j Follow

Uploaded on Feb 17, 2025 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

You are allowed to download the files provided on this website for personal or commercial use, subject to the condition that they are used lawfully. All files are the property of their respective owners.

Download Presentation

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author.

E N D

Presentation Transcript

Profile HMMs Tandy Warnow BioE/CS 598AGB

Profile Hidden Markov Models Basic tool in sequence analysis Look more complicated than they really are Used to model a family of sequences Can be built from a multiple sequence alignment Algorithms using profile HMMs are based on dynamic programming (much like Needleman- Wunsch)

Profile Given a gap-less multiple sequence alignment, we can build a profile describing what we see 1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0 S1 = A C T A G S2 = A C A A G S3 = A T T T G S4 = G T T C G

Using a profile 1 2 3 4 5 A 0.75 0.0 0.25 0.50 0.0 C 0.00 0.5 0.00 0.25 0.0 T 0.00 0.5 0.75 0.25 0.0 G 0.25 0.0 0.00 0.00 1.0 0.0 0.5 0.5 0.0 0.75 0.0 0.0 0.25 0.25 0.0 0.75 0.0 0.50 0.25 0.25 0.0 0.0 0.0 0.0 1.0 1.0 1.0 1.0 1.0 1.0 1.0 EndEnd Start The profile yields a probability distribution of sequences here, all of the same length.

Adding in insertions The profile shown in the previous slide only had *match* states (indicated by rectangles). It doesn t allow any insertions or deletions. To model indels, we just have to add additional states to the graphical model. Insertion states: Diamonds (have non-zero emission probabilities) Deletion states: Circles (nothing emitted)

From http://www.cbs.dtu.dk/~kj/bioinfo_assign2.html

From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

Building Profile HMMs Profile HMMs can be built from a given multiple sequence alignment this is not too difficult. Profile HMMs can also be built from unaligned sequences. This is a bit complicated.

Using Profile HMMs Given a Profile HMM computed for a multiple sequence alignment, you can use it to Recognize related sequences Add related sequences into the multiple sequence alignment See PFAM, http://pfam.xfam.org, for how profile HMMs are used to represent groups of functionally and structurally related proteins.

Using profile HMMs to align sequences Given an MSA for a set S of representative sequences for a gene and set X of additional sequences (query sequences) You build a profile HMM for the MSA on S. For each s in X you find for the gene, you find the path through the profile HMM that is most likely to generate s. The path specifies how to add the sequence into the MSA (only the match states count). Transitivity gives you the final MSA after you add in all the other sequences.

From http://codecereal.blogspot.com/2011/07/protein-profile-with-hmm.html

Other uses of profile HMMs Given two profile HMMs (H1 and H2), and a sequence s, you can determine which one is more likely to generate s (again, using dynamic programming). Note that computing the probability that a profile HMM generates a sequence requires calculating the probability for *every* path and adding up the probabilities. This can be calculated in polynomial time, using dynamic programming.

Applications of profile HMMs Recognizing membership in protein families (see PFAM http://pfam.xfam.org) Progressive multiple sequence alignment methods, such as SATCHMO (Edgar and Sjolander, Bioinformatics 2003) Phylogenetic placement (e.g., SEPP, Mirarab et al., PSB 2012) Taxonomic identification of metagenomic data (e.g., TIPP, Nguyen et al., Bioinformatics 2014)

Another application: UPP UPP (Nguyen et al., RECOMB 2015) is a method for ultra-large multiple sequence alignment: Up to 1,000,000 sequences Robust to fragmentary sequences Nam-phuong Nguyen (IGB postdoctoral fellow) will present this method on Tuesday.

HMMER http://hmmer.janelia.org One of the most popular collection of tools to perform analyses based on profile HMMs. HMMER web server: interactive sequence similarity searching, NAR 2011, http://nar.oxfordjournals.org/content/39/sup pl_2/W29

Supertree Estimation Purposes: Divide-and-conquer tree estimation Combining analyses performed by other research groups Tree of Life

Supertree Estimation Challenges: Tree compatibility is NP-complete (therefore, even if subtrees are correct, supertree estimation is hard) Estimated subtrees have error Advantages: Estimating individual gene trees can be computationally feasible (compared to the combined analysis of many genes) Can use different types of data for each gene

Many Supertree Methods Matrix Representation with Parsimony (Most commonly used and most accurate) MRP weighted MRP MRF MRD Robinson-Foulds Supertrees Min-Cut Modified Min-Cut Semi-strict Supertree QMC Q-imputation SDM PhySIC Majority-Rule Supertrees Maximum Likelihood Supertrees and many more ...

MRP and MRL MRP (Matrix Representation with Parsimony) MRL (Matrix Representation with Likelihood) The input set of source trees is represented by a binary matrix with ?s for missing data Each edge in each source tree gives you one column in the matrix (based on the bipartition it defines on its leafset) Then you analyze the matrix using parsimony or CFN maximum likelihood

Single gene vs. multi-gene analyses Most methods analyze single genes (or other genomic region). These produce estimated gene trees . But species trees are estimated using multiple genes.

Not all genes present in all species gene 1 gene 3 S1 S2 S3 TCTAATGGAA S1 gene 2 GCTAAGGGAA TATTGATACA S3 TCTAAGGGAA TCTAACGGAA TCTTGATACC TAGTGATGCA S4 S4 S4 GGTAACCCTC S7 S8 S5 TCTAATGGAC GCTAAACCTC S7 S8 TAGTGATGCA S6 TATAACGGAA GGTGACCATC CATTCATACC S7 GCTAAACCTC

Two competing approaches gene 1 gene 2 . . . gene k Species . . . Combined Analysis Analyze separately . . . Supertree Method

FN rate of MRP vs. combined analysis Scaffold Density (%)

SuperFine-boosting: improves accuracy of MRP Scaffold Density (%) (Swenson et al., Syst. Biol. 2012)

SuperFine First, construct a supertree with low false positives Consensus Then, refine the tree to reduce false negatives by resolving each polytomy using a base supertree method (e.g., MRP) Quartet Max Cut The Strict

Obtaining a supertree with low FP The Strict Consensus Merger (SCM) SCM of two trees Computes the strict consensus on the common leaf set Then superimposes the two trees, contracting more edges in the presence of collisions

Strict Consensus Merger (SCM) b e a e b a e c b f a c d f g g d f g a b b a c h d i j c h c i j d h d i j

Theoretical results for SCM SCM can be computed in polynomial time For certain types of inputs, the SCM method solves the NP-hard Tree Compatibility problem All splits in the SCM appear in at least one source tree (and are not contradicted by any source tree)

Performance of SCM Low false positive (FP) rate (Estimated supertree has few false edges) High false negative (FN) rate (Estimated supertree is missing many true edges)

Part II of SuperFine Refine the tree to reduce false negatives by resolving each polytomy using a base supertree method (e.g., MRP)

Part 1 of SuperFine b a e b e a e c b f a c d f g g d f g a b b a c h d i j c h c i j d h d i j

Part 2 of SuperFine b 1 e 1 a 1 e b a 1 f g f 65 65 c 1 d 4 4 c g h d i j a 1 b 1 1 2 3 h 1 i j 4 5 6 c 1 a c e b h 2 2 3 d g f d 4 4 i j 3 3

Theorem Given a set of source trees, SCM tree T, and a polytomy in T, after relabelling and reducing, each source tree has at most one leaf with each label.

Step 2: Apply MRP to the collection of reduced source trees 1 4 5 1 4 6 5 MRP 1 4 2 3 6 2 3

Replace polytomy using tree from MRP e b b a ce f 5 g a 4 d 1 g c h d i j 2 h 3 6 f j h i i j a c e b d g f

Resolving a single polytomy, v, using MRP Step 1: Reduce each source tree to a tree on leafset, {1,2,...,d} where d=degree(v) Step 2: Apply MRP to the collection of reduced source trees, to produce a tree t on {1,2,...,d} Step 3: Replace the star tree at v by tree t

SuperFine-boosting: improves accuracy of MRP Scaffold Density (%) (Swenson et al., Syst. Biol. 2012)

SuperFine is also much faster MRP 8-12 sec. SuperFine 2-3 sec. Scaffold Density (%) Scaffold Density (%) Scaffold Density (%)

Uses of Supertree Methods DACTAL: divide-and-conquer trees almost without alignments (Nelesen et al, 2012) DCM1-boosting In these methods, the dataset is divided into subsets (using chordal graph theory), and then trees on the subsets are combined (using supertree methods). Proofs about the algorithm guarantees are established using chordal graph theory.

Chordal graph algorithms yield phylogeny estimation from polynomial length sequences 0.8 NJ DCM1-NJ Theorem (Warnow et al., SODA 2001): DCM1-NJ correct with high probability given sequences of length O(ln n eO(ln n)) Simulation study from Nakhleh et al. ISMB 2001 0.6 Error Rate 0.4 0.2 0 0 400 800 1200 1600 No. Taxa

DACTAL: divide-and-conquer trees without alignment (submitted) BLAST- based Existing Method: RAxML(MAFFT) Unaligned Sequences Overlapping subsets pRecDCM3 A tree for each subset New supertree method: SuperFine A tree for the entire dataset

DACTAL more accurate than all standard methods, and much faster than SAT Average results on 3 large RNA datasets (6K to 28K) CRW: Comparative RNA database, structural alignments 3 datasets with 6,323 to 27,643 sequences Reference trees: 75% RAxML bootstrap trees DACTAL (shown in red) run for 5 iterations starting from FT(Part) SAT -1 fails on the largest dataset SAT -2 runs but is not more accurate than DACTAL, and takes longer

More divide-and-conquer Recall that SATe and PASTA use divide-and- conquer (and also iteration) to improve alignment estimation. Alignments on different subsets are merged together using techniques like OPAL and Muscle. However, alignments can also be merged together using HMM-HMM alignment. You should think of your own algorithmic designs for improving scalability and accuracy, whether for MSA or for tree estimation!

Profile HMMs

Download Presentation

Presentation Transcript

Related

More Related Content