Sequence Alignment and Scoring Matrices

 
 
Your friend has a hobby of generating random
bit strings, and finding patterns in them.
One day she come to you, excited and says: I
found the strangest thing in my randomly
generated string; a row of of length 30 which
was as follows
010101010101010101010101010101
If her original string was of length 2B, should
she be surprised?
 
We considered the basics of sequence alignment
Opt score computation
Reconstructing alignments
Local alignments
Affine gap costs
Space saving measures
Scoring matrices for DNA and protein sequences
The local alignment algorithm (for biological strings)
was given by Temple Smith, and Mike Waterman, and is
called the Smith-Waterman algorithm (SW)
Can we recreate Blast?
 
The output of BLAST is a collection of
alignments sorted in decreasing order of
scores.
 
Where do we stop?
 
Consider
a large random DNA database (size n)
a random query DNA string (size m)
A score function: match/mismatch, etc.
Use SW/BLAST to search the database against the
query, and choose all alignments that score above a
threshold ‘s’
All hits are bogus.
Q: how many hits do we get above score s?
Q: what is the probability that we even get one hit
above s?
 
BLAST: The matching regions are expanded into
alignments, which are scored using SW, and an
appropriate scoring matrix.
The results are presented in order of decreasing scores
The score is just a number.
How significant is the top scoring hits if it has a score S?
Expect/E-value (score S)= Number of times we would
expect to see a random match generate a score S, or
better
P-value= Probability if seeing a random hit score S, or
better.
How can we compute E-value, P-value?
 
query
 
26
 
19
 
405
 
422
 
Schematic
 
db
 
Q beg
 
S beg
 
Q end
 
S end
 
S Id
 
Q beg
 
S beg
 
Q end
 
S end
 
S Id
 
Some quantities can be reasonably guessed by
taking a statistical sample, others not
Average weight of a group of 100 people
Average height  of a group of 100 people
Average grade on a test
Give an example of a quantity that cannot.
When the distribution, and the expectation is
known, it is easy to see when you see
something significant.
If the distribution is not well understood, or the
wrong distribution is chosen, a wrong
conclusion can be drawn
 
P-value(11): probability that a specific value
(11) or something more extreme is achieved by
chance.
E-value(11): The number of times we expect to
see a specific value or something more extreme
just by chance.
 
P-value(11): probability that a specific value (11) or
something more extreme is achieved by chance.
E-value(11): The number of times we expect to see a specific
value or something more extreme just by chance.
E.g., look at alignment scores obtained by chance
 1, 2, 8, 3, 5, 3,6,12,4, 4,1,5,3,6,7,15
Compute a Distribution
1-2    XXX
3-4    XXXX
5-6    XXXX
7-8    XX
9-10
11-12  X
12-13
14-15  X
P-value (11) = 2/15
 
Given a collection of numbers (scores)
1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7,….
Plot its distribution as follows:
X-axis =each number
Y-axis (count/frequency/probability) of
seeing that number
More generally, the x-axis can be a range to
accommodate real numbers
 
Goal: Compute P-value(x)
A simple empirical method:
Compute a distribution of
scores against a random
database.
Use an estimate of the area
under the curve to get the
probability.
OR, fit the distribution to one
of the standard distributions.
 
x
 
P-val(x)
 
Initial assumption was that the scores followed a
normal distribution.
Z-score computation:
For any alignment, score S, shuffle one of the sequences
many times,  and recompute alignment. Get mean and
standard deviation
 
 
 
Look up a table to get a P-value
 
Initial (and natural) assumption was that scores followed a Normal
distribution
1990, Karlin and Altschul showed that ungapped local alignment
scores follow an exponential distribution
Practical consequence:
Longer tail.
Previously significant hits now not so significant
 
For simplicity, assume that the database is a
binary string, and so is the query.
Let match-score=1,
mismatch score=- 
,
indel=-
    (No gaps allowed)
What does it mean to get a score k?
 
Database size n=100M, Querysize m=1000.
 
For a typical query, there are only a few ‘real
hits’ in the database.
Much of the database is random from the
query’s perspective
Consider a random DNA string of length n.
Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25
Assume for the moment that the query is all A’s
(length m).
What is the probability that an exact match to
the query can be found?
 
Probability that there is a match starting at a
fixed position i = 0.25
m
What is the probability that some position i has
a match.
Dependencies confound probability estimates.
Q: Toss a coin many times: If it is HEADS, and
the previous time was HEADS too, you get a
dollar.
What is the money you expect to get after n
tosses?
Let X
i
 be the amount earned in the i-th toss
 
Expected number of matches can still be computed.
 
 
Let X
i
=1 if there is a match starting at position i,
 X
i
=0 otherwise
 
 
Expected number of matches =
 
i
 
Expected number of matches = n*0.25
m
If n=10
7
, m=10,
Then, expected number of matches = 9.537
If n=10
7
, m=11
expected number of hits = 2.38
n=10
7
,m=12,
Expected number of hits = 0.5 < 1
Bottom Line: An exact match to a substring of the
query is unlikely just by chance.
 
Random Database, Pr(1) = p
What is the expected number of hits to a sequence of k 1’s
 
 
 
 
Instead, consider a random binary Matrix. Expected # of diagonals
of k 1s
 
 
 
As you increase k, the number decreases exponentially.
The number of diagonals of k runs can be
approximated by a Poisson process
 
 
 
 
In ungapped alignments, we replace the coin tosses by
column scores, but the behavior does not change
(Karlin & Altschul).
As the score increases, the number of alignments that
achieve the score decreases exponentially
 
Choose a score such that the expected score between a pair of residues
< 0
Expected number of alignments with a score S
 
 
 
 
 
 
 
 
 
 
For small values, E-value and  P-value are the same
 
Bit score
Slide Note
Embed
Share

In this content, we dive into the fundamentals of sequence alignment, Opt score computation, reconstructing alignments, local alignments, affine gap costs, space-saving measures, and scoring matrices for DNA and protein sequences. We explore the Smith-Waterman algorithm (SW) for local sequence alignment and ponder on the possibility of recreating Blast. Additionally, we discuss the output of Blast, the concept of P-value computation, and the significance of top-scoring hits in Blast results. Thought experiments and practical applications of sequence alignment techniques are also explored in this comprehensive overview.

  • Sequence Alignment
  • Scoring Matrices
  • Blast
  • Smith-Waterman Algorithm
  • P-value computation

Uploaded on Sep 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. CSE182-L5: Scoring matrices Dictionary Matching

  2. Silly Quiz Your friend has a hobby of generating random bit strings, and finding patterns in them. One day she come to you, excited and says: I found the strangest thing in my randomly generated string; a row of of length 30 which was as follows 010101010101010101010101010101 If her original string was of length 2B, should she be surprised?

  3. Summary We considered the basics of sequence alignment Opt score computation Reconstructing alignments Local alignments Affine gap costs Space saving measures Scoring matrices for DNA and protein sequences The local alignment algorithm (for biological strings) was given by Temple Smith, and Mike Waterman, and is called the Smith-Waterman algorithm (SW) Can we recreate Blast?

  4. Whats next? The output of BLAST is a collection of alignments sorted in decreasing order of scores. Where do we stop?

  5. A thought experiment Consider a large random DNA database (size n) a random query DNA string (size m) A score function: match/mismatch, etc. Use SW/BLAST to search the database against the query, and choose all alignments that score above a threshold s All hits are bogus. Q: how many hits do we get above score s? Q: what is the probability that we even get one hit above s?

  6. P-value computation BLAST: The matching regions are expanded into alignments, which are scored using SW, and an appropriate scoring matrix. The results are presented in order of decreasing scores The score is just a number. How significant is the top scoring hits if it has a score S? Expect/E-value (score S)= Number of times we would expect to see a random match generate a score S, or better P-value= Probability if seeing a random hit score S, or better. How can we compute E-value, P-value?

  7. Blast Output 1 S Id Schematic Q beg 26 422 query db S beg 19 405 Q end S end

  8. Blast Output 2 (drosophila) S Id Q beg S beg Q end S end

  9. Expectation? Some quantities can be reasonably guessed by taking a statistical sample, others not Average weight of a group of 100 people Average height of a group of 100 people Average grade on a test Give an example of a quantity that cannot. When the distribution, and the expectation is known, it is easy to see when you see something significant. If the distribution is not well understood, or the wrong distribution is chosen, a wrong conclusion can be drawn

  10. P-value and E-value P-value(11): probability that a specific value (11) or something more extreme is achieved by chance. E-value(11): The number of times we expect to see a specific value or something more extreme just by chance.

  11. P-value P-value(11): probability that a specific value (11) or something more extreme is achieved by chance. E-value(11): The number of times we expect to see a specific value or something more extreme just by chance. E.g., look at alignment scores obtained by chance 1, 2, 8, 3, 5, 3,6,12,4, 4,1,5,3,6,7,15 Compute a Distribution 1-2 XXX 3-4 XXXX 5-6 XXXX 7-8 XX 9-10 11-12 X 12-13 14-15 X P-value (11) = 2/15

  12. Distribution Given a collection of numbers (scores) 1, 2, 8, 3, 5, 3,6, 4, 4,1,5,3,6,7, . Plot its distribution as follows: X-axis =each number Y-axis (count/frequency/probability) of seeing that number More generally, the x-axis can be a range to accommodate real numbers

  13. P-value computation Goal: Compute P-value(x) A simple empirical method: Compute a distribution of scores against a random database. Use an estimate of the area under the curve to get the probability. OR, fit the distribution to one of the standard distributions. x P-val(x)

  14. Z-scores for alignment Initial assumption was that the scores followed a normal distribution. Z-score computation: For any alignment, score S, shuffle one of the sequences many times, and recompute alignment. Get mean and standard deviation ZS=S -m s -z2 1 2pe f (Z) = 2 Look up a table to get a P-value

  15. Blast E-value Initial (and natural) assumption was that scores followed a Normal distribution 1990, Karlin and Altschul showed that ungapped local alignment scores follow an exponential distribution Practical consequence: Longer tail. Previously significant hits now not so significant

  16. Altschul Karlin statistics For simplicity, assume that the database is a binary string, and so is the query. Let match-score=1, mismatch score=- , indel=- (No gaps allowed) What does it mean to get a score k?

  17. Exact matches Database (n) Database size n=100M, Querysize m=1000. Query (m)

  18. Observations For a typical query, there are only a few real hits in the database. Much of the database is random from the query s perspective Consider a random DNA string of length n. Pr[A]=Pr[C] = Pr[G]=Pr[T]=0.25 Assume for the moment that the query is all A s (length m). What is the probability that an exact match to the query can be found?

  19. Basic probability Probability that there is a match starting at a fixed position i = 0.25m What is the probability that some position i has a match. Dependencies confound probability estimates.

  20. Basic Probability:Expectation Q: Toss a coin many times: If it is HEADS, and the previous time was HEADS too, you get a dollar. What is the money you expect to get after n tosses? Let Xi be the amount earned in the i-th toss E(Xi) =1 p+(0) (1- p) = p i i Xi) = E(Xi) = E( np

  21. Expected number of matches Expected number of matches can still be computed. i Let Xi=1 if there is a match starting at position i, Xi=0 otherwise Pr(Match at Position i) = pi=0.25m E(Xi) = pi=0.25m Expected number of matches = ( ) i i m = n14 Xi)= E( E(Xi)

  22. Expected number of exact Matches is small! Expected number of matches = n*0.25m If n=107, m=10, Then, expected number of matches = 9.537 If n=107, m=11 expected number of hits = 2.38 n=107,m=12, Expected number of hits = 0.5 < 1 Bottom Line: An exact match to a substring of the query is unlikely just by chance.

  23. Exponential distribution Random Database, Pr(1) = p What is the expected number of hits to a sequence of k 1 s 1 p -kln (n-k)pk@ nekln p= ne Instead, consider a random binary Matrix. Expected # of diagonals of k 1s 1 p -kln L =(n-k)(m-k)pk@ nmekln p= nme

  24. As you increase k, the number decreases exponentially. The number of diagonals of k runs can be approximated by a Poisson process Pr[u]=Lue-L u! Pr[u >0]=1-e-L In ungapped alignments, we replace the coin tosses by column scores, but the behavior does not change (Karlin & Altschul). As the score increases, the number of alignments that achieve the score decreases exponentially

  25. Blast E-value Choose a score such that the expected score between a pair of residues < 0 Expected number of alignments with a score S Bit score -lS-ln K ln 2 E =Kmne-lS= mn2 Pr(S x) =1-e-Kmne-lx For small values, E-value and P-value are the same

  26. End of Lecture 5

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#