Gene Prediction: Similarity-Based Approaches in Bioinformatics

undefined
 
D
y
n
a
m
i
c
 
P
r
o
g
r
a
m
m
i
n
g
 
I
I
 
G
e
n
e
 
P
r
e
d
i
c
t
i
o
n
:
 
S
i
m
i
l
a
r
i
t
y
-
B
a
s
e
d
 
A
p
p
r
o
a
c
h
e
s
 
The idea of similarity-based approach to gene prediction
Exon Chaining Problem
Spliced Alignment Problem
 
Gene Prediction
Computational problem of predicting the locations of genes in a genome given only the
genomic DNA sequence (gene is broken into pieces called as exons that are separated by junk
NDA/introns).
Two approaches
1.
Statistical Approaches : 
limited success since it tends to match frequently in the
genome at non-splice sites
 using a profile describing the propensities of different nucleotides to occur at
different positions.
  use stop codons, TAA, TAG, TGA, or start codon ATG
  AG and GT on the left- and right-hand sides of an exon are highly conserved.
2.
Similarity-Based Approaches: 
using previously sequenced genes and their
protein products as a template for the recognition of unknown genes in newly
sequenced DNA fragments.
Find all local similarities between the genomic sequence and the target
protein sequence (using local alignment algorithm)
Shorten or extend the substrings with high similarity such that they start at
AG and end at GT. The resulting set may contain overlapping substrings.
Choose the best subset of non-overlapping substrings as a putative exon
structure.
 
Similarity-Based Approach to Gene Prediction
 
Genes in different organisms are similar
The similarity-based approach uses known
genes in one genome to predict (unknown)
genes in another genome
P
r
o
b
l
e
m
:
 
G
i
v
e
n
 
a
 
k
n
o
w
n
 
g
e
n
e
 
a
n
d
 
a
n
u
n
a
n
n
o
t
a
t
e
d
 
g
e
n
o
m
e
 
s
e
q
u
e
n
c
e
,
 
f
i
n
d
 
a
 
s
e
t
 
o
f
s
u
b
s
t
r
i
n
g
s
 
o
f
 
t
h
e
 
g
e
n
o
m
i
c
 
s
e
q
u
e
n
c
e
 
w
h
o
s
e
c
o
n
c
a
t
e
n
a
t
i
o
n
 
b
e
s
t
 
f
i
t
s
 
t
h
e
 
g
e
n
e
 
Comparing Genes in Two Genomes
 
Small islands of similarity corresponding to
similarities between exons
 
Reverse Translation
 
Given a known protein, find a gene in the
genome which codes for it
One might infer the coding DNA of the given
protein by reversing the translation process
Inexact: amino acids map to > 1 codon
   UUA, UUG, CUU, CUC, CUA, CUG: leucine
This problem is essentially reduced to an
alignment problem
 
Comparing Genomic DNA Against mRNA
 
Using Similarities to Find the Exon Structure
 
The known frog gene is aligned to different locations
in  the human genome
Find the “best” path to reveal the exon structure of
human gene
 
Finding Local Alignments
 
Use local alignments to find all islands of similarity
 
Chaining Local Alignments
 
Find substrings that match a given gene sequence
(
candidate exons
)
Define a candidate exons as
                      (
l, r, w
)
   (
left, right, weight
 defined as score of local alignment)
Look for a maximum 
chain
 of substrings
Chain: a set of non-overlapping nonadjacent
intervals.
Exon Chaining Problem
Locate the beginning and end of each interval
(
2n
 points)
Find the “best” path
 
Exon Chaining ProblemP
Given a set of putative exons, where each exon is represented by (
l,r,w
), 
l 
and
 r 
are
the left- and right-hand positions, and 
w
 is the weight reflecting the likelihood that
this interval is an exon (e.g., the local alignment score), find a maximum set of non-
overlapping putative exons.
Input: 
A set of weighted intervals (putative exons)
Output: 
A maximum chain of intervals from this set.
 
ExonChaining(G,n)
  
for i := 1 to 2n
        s[i] :=0; 
l
[i] = 0;
  for i : = 1 to 2n
        if vertex v[i] in G corresponds to
            the right end of an interval I
                 j := index of vertex for left end of the interval I
                w := weight of the interval I
                s[i]:= max{s[j] + w, s[i-1]}
                if (s[i] = s[j]+w)  then 
l
[i]= j;
        else s[i] := s[i-1]
 
v
PrintChain(
l, 
m)
If m = 0 return
Else
if ( 
l
[m] ≠ 0 )
     PrintChain(
l, l
[m])
     Print “(“ 
l
[m], m “)”
else PrintChain (l, m-1)
 
    1     2      3      4      5      6      7     8     9   10   11   12    13   14    15    16    17   18
 
Exon Chaining: Deficiencies
 
 
Poor definition of the putative exon endpoints
O
p
t
i
m
a
l
 
c
h
a
i
n
 
o
f
 
i
n
t
e
r
v
a
l
s
 
m
a
y
 
n
o
t
 
c
o
r
r
e
s
p
o
n
d
 
t
o
 
a
n
y
 
v
a
l
i
d
a
l
i
g
n
m
e
n
t
First interval may correspond to a suffix, whereas second
interval may correspond to a prefix
Combination of such intervals is not a valid alignment
 
Infeasible Chains
 
   Red local similarities form two non -overlapping
intervals but do not form a valid global alignment
 
Gene Prediction Analogy: Selecting Putative Exons
 
The cell carries DNA as a blueprint for producing proteins,
 like a manufacturer carries a blueprint for producing a car.
 
Using Blueprint
 
Assembling Putative Exons
 
Still Assembling Putative Exons
 
Spliced Alignment
 
M
i
k
h
a
i
l
 
G
e
l
f
a
n
d
 
a
n
d
 
c
o
l
l
e
a
g
u
e
s
 
p
r
o
p
o
s
e
d
 
a
 
s
p
l
i
c
e
d
a
l
i
g
n
m
e
n
t
 
a
p
p
r
o
a
c
h
 
o
f
 
u
s
i
n
g
 
a
 
p
r
o
t
e
i
n
 
w
i
t
h
i
n
 
o
n
e
g
e
n
o
m
e
 
t
o
 
r
e
c
o
n
s
t
r
u
c
t
 
t
h
e
 
e
x
o
n
-
i
n
t
r
o
n
 
s
t
r
u
c
t
u
r
e
 
o
f
 
a
(
r
e
l
a
t
e
d
)
 
g
e
n
e
 
i
n
 
a
n
o
t
h
e
r
 
g
e
n
o
m
e
.
Begins by selecting either all putative exons between
potential acceptor and donor sites or by finding all
substrings similar to the target protein (as in the Exon
Chaining Problem).
This set is further filtered in a such a way that attempt
to retain all true exons, with some false ones.
 
Spliced Alignment Problem: Formulation
 
G
o
a
l
:
 
F
i
n
d
 
a
 
c
h
a
i
n
 
o
f
 
b
l
o
c
k
s
 
i
n
 
a
 
g
e
n
o
m
i
c
s
e
q
u
e
n
c
e
 
t
h
a
t
 
b
e
s
t
 
f
i
t
s
 
a
 
t
a
r
g
e
t
 
s
e
q
u
e
n
c
e
I
n
p
u
t
:
 
G
e
n
o
m
i
c
 
s
e
q
u
e
n
c
e
s
 
G
,
 
t
a
r
g
e
t
s
e
q
u
e
n
c
e
 
T
,
 
a
n
d
 
a
 
s
e
t
 
o
f
 
c
a
n
d
i
d
a
t
e
 
e
x
o
n
s
 
B
.
O
u
t
p
u
t
:
 
A
 
c
h
a
i
n
 
o
f
 
e
x
o
n
s
 
Γ
 
s
u
c
h
 
t
h
a
t
 
t
h
e
g
l
o
b
a
l
 
a
l
i
g
n
m
e
n
t
 
s
c
o
r
e
 
b
e
t
w
e
e
n
 
Γ
*
 
a
n
d
 
 
T
 
i
s
m
a
x
i
m
u
m
 
a
m
o
n
g
 
a
l
l
 
c
h
a
i
n
s
 
o
f
 
b
l
o
c
k
s
 
f
r
o
m
 
B
.
    
Γ
* - concatenation of all exons from chain 
Γ
 
Lewis Carroll Example
1
2
3
4
1
2
4
3
 
Spliced Alignment: Idea
 
Compute the best alignment between
 i
-prefix of genomic
sequence 
G
 and
 j-
prefix of target 
T:
                              S(i,j)
 
But what is
 “i-
prefix
of
 G?
There may be a few
 i-
prefixes
 
of
 G 
depending on which
block
 B 
we are in.
C
o
m
p
u
t
e
 
t
h
e
 
b
e
s
t
 
a
l
i
g
n
m
e
n
t
 
b
e
t
w
e
e
n
 
i
-
p
r
e
f
i
x
 
o
f
 
g
e
n
o
m
i
c
s
e
q
u
e
n
c
e
 
G
 
a
n
d
 
j
-
p
r
e
f
i
x
 
o
f
 
t
a
r
g
e
t
 
T
 
u
n
d
e
r
 
t
h
e
 
a
s
s
u
m
p
t
i
o
n
t
h
a
t
 
t
h
e
 
a
l
i
g
n
m
e
n
t
 
u
s
e
s
 
t
h
e
 
b
l
o
c
k
 
B
 
a
t
 
p
o
s
i
t
i
o
n
 
i
                                 
S(i,j,B)
 
 
 
Spliced Alignment Recurrence
 
 
 
 
 
 
I
f
 
i
 
i
s
 
n
o
t
 
t
h
e
 
s
t
a
r
t
i
n
g
 
v
e
r
t
e
x
 
o
f
 
b
l
o
c
k
 
B
:
S(i, j, B)
 =
  
max {    
S(i – 1, j, B) – indel penalty
   
 
S(i, j – 1, B) – indel penalty
   
 
S(i – 1, j – 1, B) + 
δ
(g
i
, t
j
)
 }
I
f
 
i
 
i
s
 
t
h
e
 
s
t
a
r
t
i
n
g
 
v
e
r
t
e
x
 
o
f
 
b
l
o
c
k
 
B
:
S(i, j, B)
 =
 
  max { 
S(i, j – 1, B) – indel penalty
max
all blocks 
B’
 preceding block 
B 
S(end(B’), j, B’) – indel penalty
max
all blocks
 B
’ preceding block 
B 
S(end(B’), j – 1, B’) + 
δ
(g
i
, t
j
)
}
 
 
 
S(i,j,B)
 
Spliced Alignment Solution
 
After computing the three-dimensional table
S(i, j, B)
, the score of the optimal spliced
alignment is:
 
max
all blocks
 
B
S(end(B), length(T), B)
 
Spliced Alignment: Complications
 
Considering multiple 
i
-prefixes leads to slow down.
running time:
                              O(mn
2
 
|
B|)
    
where
 m 
is the target length,
 n 
is the genomic
sequence length and
 
|
B
|
 
is the number of blocks.
 
A
 
m
o
s
a
i
c
 
e
f
f
e
c
t
:
 
s
h
o
r
t
 
e
x
o
n
s
 
a
r
e
 
e
a
s
i
l
y
 
c
o
m
b
i
n
e
d
t
o
 
f
i
t
 
a
n
y
 
t
a
r
g
e
t
 
p
r
o
t
e
i
n
Spliced Alignment: Speedup 
Spliced Alignment: Speedup 
 
Spliced Alignment: Speedup
 
     P(i,j)=
max
all blocks 
B
 preceding position
 i 
S(end(B), j, B)
 
Exon Chaining vs Spliced Alignment
 
In Spliced Alignment, every path spells out
string obtained by concatenation of labels of
its edges. The weight of the path is defined as
optimal alignment score between
concatenated labels (blocks) and target
sequence
Defines weight of entire path in graph, but
not the weights for individual edges.
Exon Chaining assumes the positions and weights
of exons are pre-defined
 
Gene Prediction Tools
 
GENSCAN/Genome Scan
TwinScan
Glimmer
GenMark
 
Gene Prediction: Aligning Genome vs. Genome
 
Align entire human and mouse genomes
 
Predict genes in both sequences
simultaneously as chains of aligned blocks
(exons)
 
This approach does not assume any
annotation of either human or mouse genes.
 
 
The GENSCAN Algorithm
 
 
Algorithm is based on probabilistic model of gene structure
similar to 
Hidden Markov Models (HMMs).
GENSCAN uses a training set in order to estimate the
HMM parameters
, then the algorithm returns the exon
structure using maximum likelihood approach standard
to many HMM algorithms (
Viterbi
 algorithm).
Biological input: Codon bias in coding regions, gene
structure (start and stop codons, typical exon and
intron length, presence of promoters, presence of
genes on both strands, etc)
Covers cases where input sequence contains no
gene, partial gene, complete gene, multiple genes.
 
 
GENSCAN Limitations
 
Does not use similarity search to predict
genes.
Does not address alternative splicing.
Could combine two exons from
consecutive genes together
 
Incorporates similarity information into
GENSCAN: predicts gene structure which
corresponds to maximum probability conditional
on similarity information
Algorithm is a combination of two sources of information
Probabilistic models of exons-introns
Sequence similarity information
 
 
GenomeScan
 
TwinScan
 
Aligns two sequences and marks each base
as gap ( - ), mismatch (:), match (|), resulting
in a new alphabet of 12 letters: 
Σ
 {A-, A:, A |,
C-, C:, C |, G-, G:, G |, T-, T:, T|}.
Run Viterbi algorithm using emissions 
e
k
(b)
where 
b
 
 {A-, A:, A|, …, T|}.
 
http://www.standford.edu/class/cs262/
Spring2003/Notes/ln10.pdf
 
TwinScan
 (cont’d)
 
The emission probabilities are estimated from
from human/mouse gene pairs.
Ex. 
e
I
(x|) < e
E
(x|)
 since matches are
favored in exons, and 
e
I
(x-) > e
E
(x-)
 since
gaps (as well as mismatches) are favored
in introns.
Compensates for dominant occurrence of
poly-A region in introns
 
Glimmer
 
G
e
n
e
 
L
o
c
a
t
o
r
 
a
n
d
 
I
n
t
e
r
p
o
l
a
t
e
d
 
M
a
r
k
o
v
 
M
o
d
e
l
E
R
Finds genes in bacterial DNA
Uses interpolated Markov Models
 
The Glimmer Algorithm
 
Made of 2 programs
B
u
i
l
d
I
M
M
Takes sequences as input and outputs the
Interpolated Markov Models (IMMs)
G
l
i
m
m
e
r
Takes IMMs and outputs all candidate genes
Automatically resolves overlapping genes by
choosing one, hence limited
Marks “suspected to truly overlap” genes for
closer inspection by user
 
GenMark
 
 
Based on 
non-stationary
 Markov chain models
 
Results displayed graphically with coding vs.
noncoding probability dependent on position in
nucleotide sequence
 
Homework Assignment
(1)
Solve Exon Chaining Problem in slide 11 when the weight for interval
(6,12) change to 18 and weight for interval change to (13,14) = 4
 
(2) Implement Exon Chaining Algorithm in slide 12.
Print out the tables s, l and exon chain for the problem in slide 11.
Slide Note
Embed
Share

Gene prediction in bioinformatics involves predicting gene locations in a genome using different approaches like statistical methods and similarity-based approaches. The similarity-based approach uses known genes as a template to predict unknown genes in newly sequenced DNA fragments. This method involves finding local similarities between genomic and target protein sequences and selecting non-overlapping substrings with high similarity as putative exon structures. Other techniques discussed include the Exon Chaining Problem and Reverse Translation for inferring coding DNA from known proteins.


Uploaded on Jul 17, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Dynamic Programming II Dynamic Programming II Gene Prediction: Similarity Gene Prediction: Similarity- -Based Approaches The idea of similarity-based approach to gene prediction Exon Chaining Problem Spliced Alignment Problem Based Approaches

  2. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Gene Prediction Computational problem of predicting the locations of genes in a genome given only the genomic DNA sequence (gene is broken into pieces called as exons that are separated by junk NDA/introns). Two approaches 1. Statistical Approaches : limited success since it tends to match frequently in the genome at non-splice sites using a profile describing the propensities of different nucleotides to occur at different positions. use stop codons, TAA, TAG, TGA, or start codon ATG AG and GT on the left- and right-hand sides of an exon are highly conserved. 2. Similarity-Based Approaches: using previously sequenced genes and their protein products as a template for the recognition of unknown genes in newly sequenced DNA fragments. Find all local similarities between the genomic sequence and the target protein sequence (using local alignment algorithm) Shorten or extend the substrings with high similarity such that they start at AG and end at GT. The resulting set may contain overlapping substrings. Choose the best subset of non-overlapping substrings as a putative exon structure.

  3. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Similarity-Based Approach to Gene Prediction Genes in different organisms are similar The similarity-based approach uses known genes in one genome to predict (unknown) genes in another genome Problem: Given a known gene and an unannotated genome sequence, find a set of substrings of the genomic sequence whose concatenation best fits the gene

  4. An Introduction to Bioinformatics Algorithms Comparing Genes in Two Genomes www.bioalgorithms.info Small islands of similarity corresponding to similarities between exons

  5. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Reverse Translation Given a known protein, find a gene in the genome which codes for it One might infer the coding DNA of the given protein by reversing the translation process Inexact: amino acids map to > 1 codon UUA, UUG, CUU, CUC, CUA, CUG: leucine This problem is essentially reduced to an alignment problem

  6. An Introduction to Bioinformatics Algorithms Comparing Genomic DNA Against mRNA www.bioalgorithms.info (codon sequence) mRNA { { { { { exon1 intron1 exon2 intron2 exon3 Portion of genome

  7. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Using Similarities to Find the Exon Structure The known frog gene is aligned to different locations in the human genome Find the best path to reveal the exon structure of human gene Frog Gene (known) Human Genome

  8. An Introduction to Bioinformatics Algorithms Finding Local Alignments www.bioalgorithms.info Use local alignments to find all islands of similarity Frog Genes (known) Human Genome

  9. An Introduction to Bioinformatics Algorithms Chaining Local Alignments www.bioalgorithms.info Find substrings that match a given gene sequence (candidate exons) Define a candidate exons as (l, r, w) (left, right, weight defined as score of local alignment) Look for a maximum chain of substrings Chain: a set of non-overlapping nonadjacent intervals.

  10. An Introduction to Bioinformatics Algorithms Exon Chaining Problem www.bioalgorithms.info 5 5 15 9 11 4 3 0 2 3 5 6 11 13 16 20 25 27 28 30 32 Locate the beginning and end of each interval (2n points) Find the best path

  11. An Introduction to Bioinformatics Algorithms Exon Chaining ProblemP Given a set of putative exons, where each exon is represented by (l,r,w), l and r are the left- and right-hand positions, and w is the weight reflecting the likelihood that this interval is an exon (e.g., the local alignment score), find a maximum set of non- overlapping putative exons. Input: A set of weighted intervals (putative exons) Output: A maximum chain of intervals from this set. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 www.bioalgorithms.info 3 1 0 4 5 12 6 7 10 5 10 12 3 6 4 7 1 0 0 3 3 5 5 5 9 9 10 10 15 15 15 17 17 17 21

  12. An Introduction to Bioinformatics Algorithms ExonChaining(G,n) for i := 1 to 2n s[i] :=0; l[i] = 0; for i : = 1 to 2n if vertex v[i] in G corresponds to the right end of an interval I j := index of vertex for left end of the interval I w := weight of the interval I s[i]:= max{s[j] + w, s[i-1]} if (s[i] = s[j]+w) then l[i]= j; else s[i] := s[i-1] v www.bioalgorithms.info PrintChain(l, m) If m = 0 return Else if ( l[m] 0 ) PrintChain(l, l[m]) Print ( l[m], m ) else PrintChain (l, m-1) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 3 1 0 4 5 13 I 6 7 10 5 10 13 3 6 4 7 1 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 0 0 3 3 5 5 5 9 9 10 10 15 15 15 17 17 18 21 s l 0 0 2 0 1 0 0 4 0 9 0 5 0 0 11 0 7 16

  13. An Introduction to Bioinformatics Algorithms Exon Chaining: Deficiencies www.bioalgorithms.info Poor definition of the putative exon endpoints Optimal chain of intervals may not correspond to any valid alignment First interval may correspond to a suffix, whereas second interval may correspond to a prefix Combination of such intervals is not a valid alignment

  14. An Introduction to Bioinformatics Algorithms Infeasible Chains www.bioalgorithms.info Red local similarities form two non -overlapping intervals but do not form a valid global alignment Frog Genes (known) Human Genome

  15. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Gene Prediction Analogy: Selecting Putative Exons The cell carries DNA as a blueprint for producing proteins, like a manufacturer carries a blueprint for producing a car.

  16. An Introduction to Bioinformatics Algorithms Using Blueprint www.bioalgorithms.info

  17. An Introduction to Bioinformatics Algorithms Assembling Putative Exons www.bioalgorithms.info

  18. An Introduction to Bioinformatics Algorithms Still Assembling Putative Exons www.bioalgorithms.info

  19. An Introduction to Bioinformatics Algorithms Spliced Alignment www.bioalgorithms.info Mikhail Gelfand and colleagues proposed a spliced alignment approach of using a protein within one genome to reconstruct the exon-intron structure of a (related) gene in another genome. Begins by selecting either all putative exons between potential acceptor and donor sites or by finding all substrings similar to the target protein (as in the Exon Chaining Problem). This set is further filtered in a such a way that attempt to retain all true exons, with some false ones.

  20. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Spliced Alignment Problem: Formulation Goal: Find a chain of blocks in a genomic sequence that best fits a target sequence Input: Genomic sequences G, target sequence T, and a set of candidate exons B. Output: A chain of exons such that the global alignment score between * and T is maximum among all chains of blocks from B. * - concatenation of all exons from chain

  21. An Introduction to Bioinformatics Algorithms Lewis Carroll Example www.bioalgorithms.info 1 2 3 4 1 2 4 3

  22. An Introduction to Bioinformatics Algorithms Spliced Alignment: Idea www.bioalgorithms.info Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T: S(i,j) But what is i-prefix of G? There may be a few i-prefixesof G depending on which block B we are in. Compute the best alignment between i-prefix of genomic sequence G and j-prefix of target T under the assumption that the alignment uses the block B at position i S(i,j,B)

  23. An Introduction to Bioinformatics Algorithms Spliced Alignment Recurrence If i is not the starting vertex of block B: S(i, j, B) = max { S(i 1, j, B) indel penalty S(i, j 1, B) indel penalty S(i 1, j 1, B) + (gi, tj) } If i is the starting vertex of block B: S(i, j, B) = max { S(i, j 1, B) indel penalty maxall blocks B preceding block B S(end(B ), j, B ) indel penalty maxall blocks B preceding block B S(end(B ), j 1, B ) + (gi, tj) } T www.bioalgorithms.info j j-1 G End(B ) i-1 i S(i,j,B)

  24. An Introduction to Bioinformatics Algorithms Spliced Alignment Solution www.bioalgorithms.info After computing the three-dimensional table S(i, j, B), the score of the optimal spliced alignment is: maxall blocksBS(end(B), length(T), B)

  25. An Introduction to Bioinformatics Algorithms Spliced Alignment: Complications www.bioalgorithms.info Considering multiple i-prefixes leads to slow down. running time: O(mn2|B|) where m is the target length, n is the genomic sequence length and|B|is the number of blocks. A mosaic effect: short exons are easily combined to fit any target protein

  26. An Introduction to Bioinformatics Algorithms Spliced Alignment: Speedup www.bioalgorithms.info

  27. An Introduction to Bioinformatics Algorithms Spliced Alignment: Speedup www.bioalgorithms.info

  28. An Introduction to Bioinformatics Algorithms Spliced Alignment: Speedup www.bioalgorithms.info P(i,j)=maxall blocks B preceding position i S(end(B), j, B)

  29. An Introduction to Bioinformatics Algorithms Exon Chaining vs Spliced Alignment www.bioalgorithms.info In Spliced Alignment, every path spells out string obtained by concatenation of labels of its edges. The weight of the path is defined as optimal alignment score between concatenated labels (blocks) and target sequence Defines weight of entire path in graph, but not the weights for individual edges. Exon Chaining assumes the positions and weights of exons are pre-defined

  30. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Gene Prediction Tools GENSCAN/Genome Scan TwinScan Glimmer GenMark

  31. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Gene Prediction: Aligning Genome vs. Genome Align entire human and mouse genomes Predict genes in both sequences simultaneously as chains of aligned blocks (exons) This approach does not assume any annotation of either human or mouse genes.

  32. An Introduction to Bioinformatics Algorithms The GENSCAN Algorithm www.bioalgorithms.info Algorithm is based on probabilistic model of gene structure similar to Hidden Markov Models (HMMs). GENSCAN uses a training set in order to estimate the HMM parameters, then the algorithm returns the exon structure using maximum likelihood approach standard to many HMM algorithms (Viterbi algorithm). Biological input: Codon bias in coding regions, gene structure (start and stop codons, typical exon and intron length, presence of promoters, presence of genes on both strands, etc) Covers cases where input sequence contains no gene, partial gene, complete gene, multiple genes.

  33. An Introduction to Bioinformatics Algorithms GENSCAN Limitations Does not use similarity search to predict genes. Does not address alternative splicing. Could combine two exons from consecutive genes together www.bioalgorithms.info

  34. An Introduction to Bioinformatics Algorithms GenomeScan www.bioalgorithms.info Incorporates similarity information into GENSCAN: predicts gene structure which corresponds to maximum probability conditional on similarity information Algorithm is a combination of two sources of information Probabilistic models of exons-introns Sequence similarity information

  35. An Introduction to Bioinformatics Algorithms TwinScan www.bioalgorithms.info Aligns two sequences and marks each base as gap ( - ), mismatch (:), match (|), resulting in a new alphabet of 12 letters: {A-, A:, A |, C-, C:, C |, G-, G:, G |, T-, T:, T|}. Run Viterbi algorithm using emissions ek(b) where b {A-, A:, A|, , T|}. http://www.standford.edu/class/cs262/ Spring2003/Notes/ln10.pdf

  36. An Introduction to Bioinformatics Algorithms TwinScan(cont d) www.bioalgorithms.info The emission probabilities are estimated from from human/mouse gene pairs. Ex. eI(x|) < eE(x|) since matches are favored in exons, and eI(x-) > eE(x-) since gaps (as well as mismatches) are favored in introns. Compensates for dominant occurrence of poly-A region in introns

  37. An Introduction to Bioinformatics Algorithms Glimmer www.bioalgorithms.info Gene Locator and Interpolated Markov ModelER Finds genes in bacterial DNA Uses interpolated Markov Models

  38. An Introduction to Bioinformatics Algorithms The Glimmer Algorithm www.bioalgorithms.info Made of 2 programs BuildIMM Takes sequences as input and outputs the Interpolated Markov Models (IMMs) Glimmer Takes IMMs and outputs all candidate genes Automatically resolves overlapping genes by choosing one, hence limited Marks suspected to truly overlap genes for closer inspection by user

  39. An Introduction to Bioinformatics Algorithms GenMark www.bioalgorithms.info Based on non-stationary Markov chain models Results displayed graphically with coding vs. noncoding probability dependent on position in nucleotide sequence

  40. An Introduction to Bioinformatics Algorithms www.bioalgorithms.info Homework Assignment (1) Solve Exon Chaining Problem in slide 11 when the weight for interval (6,12) change to 18 and weight for interval change to (13,14) = 4 (2) Implement Exon Chaining Algorithm in slide 12. Print out the tables s, l and exon chain for the problem in slide 11.

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#