Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform
The presentation discusses the challenges in short read alignment, introduces the Burrows-Wheeler Transform (BWT) algorithm, and compares it with other alignment algorithms. It explores exact and inexact matching approaches, presenting results and conclusions. The focus is on improving alignment speed and memory efficiency while handling mismatches effectively.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Final Presentation Fast and Accurate Short Read Alignment with Burrows-Wheeler Transform Group 1 (1) (2) (3) (4) (5)
Outline 1) Introduction & Background review 2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching 4) Inexact Matching 5) Result & Conclusion 6) Reference
Introduction (1/3) [1] Motivation: Much reads: 50~200 million 32-100 bp reads Reference sequence determined
Introduction (2/3) [2] BLAST/BLAT Suffix array: Requires 12GB for human genome Requires New Alignment Algorithm
Introduction (2/3) [1] Four category of algorithms for this problem Category Representative Pros Cons Hash the read sequence MAQ Flexible memory footprint No multi- threading Hash the genome ReSEQ Easy multi- threading Large memory Merge-sorting sequences Malhis *** Hard for pairing Burrows-Wheeler Transform Bowtie Relative small memory footprint ***
Comparison Feature Speed memory Hash read sequence No multi-threading Memory footprint Hash genome Multi-threading large Merge sorting fast (no pairing) BWT fast Smaller memory footprint Basing BWT, inexact matching algorithm proposed
Outline 1) Introduction & Background review 2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching 4) Inexact Matching 5) Result & Conclusion 6) Reference
Prefix of string GOOGOL G GO GOO GOOG GOOGO GOOGOL
2.1 Prefix trie and string matching dashed line shows the route of the brute-force search for a query string LOL , allowing at most one mismatch Suffix array interval ^ mark start of the string
Testing whether a query W is an exact substring of X can be done in O(|W|) time. To allow mismatches, we can exhaustively traverse the trie. We will show later how to accelerate this search by using prefix information of W.
Suffix of string GOOGOL GOOGOL OOGOL OGOL GOL OL L
Define some variables A string X = a0a1 : : : an-1 is always ended with symbol $. X[i] = ai, X[i; j] =ai .. aj, a substring of X Xi = X[i, n-1], a suffix of X Suffix array S, S(i) is the start position of the i-th smallest suffix. B[i] = $ when S(i) = 0 and B[i] = X[S(i) - 1] otherwise.
In practice, we usually construct the suffix array first and then generate BWT. Most algorithms for constructing suffix array require at least of working space, which amounts to 12GB for human genome. Hon et al. (2007) gave a new algorithm which will only require less than 1GB memory at peak time for constructing the BWT of human genome. This algorithm is implemented in BWT-SW (Lam et al., 2008). We adapted its source code to make it work with BWA (this paper).[3][4] bits log n n 2
2.3 Suffix array interval and sequence alignment = R (W) min{k : W is the prefix of X } S(k) = R (W) max{k : W is the prefix of X } S(k) the set of positions of all occurrences of W in X is ) ( : ) ( { R k W R k S is called the Suffix array interval of W [ R (W), R ( W )] ( )} W
For example the SA interval of string go is [1; 2]. The suffix array values in this interval are 3 and 0 which give the positions of all the occurrences of go . Sequence alignment is equivalent to searching for the SA intervals of substrings of X that match the query. For the exact matching problem, we can find only one such interval. For the inexact matching problem, there may be many.
Outline 1) Introduction & Background review 2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching 4) Inexact Matching 5) Result & Conclusion 6) Reference
Review X = googol$ ?(?) min { k : W is the prefix of XS(k) } max { k : W is the prefix of XS(k) } ?(?) ?(??) = 1 ?(??) = 2
Definition X = googol$ C(a) The number of symbols in X[0,n-2] that are lexicographically smaller than a C(g) = 0 C(l) = 2 C(o) = 3
Definition X = googol$ O(a,i) The number of occurrences of a in B[0,i] 0 , i = 0 1 , i = 1,2 2 , i = 3 3 , 4 <= i <= 6 O(o,i) = 0 , 0 <= i <= 4 1 , i = 5 2 , i = 6 O(g,i) = O(l,i) = 1 , 0 <= I <= 6
Definition X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo g o $ o l o g
Meaning X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo C(o) = 3
Meaning X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo
Meaning X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo
Meaning X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo If ? ?? R(aW) >= 0, then aW is a substring of X
Example X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo C(o) = 3
Example X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo C(o) = 3 O(o, 0) = 0 ? ? = 2 R(W) = 1 ? ??? = C(o) + O(o, 0) + 1 = 3 + 0 + 1 = 4
Example X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo C(o) = 3
Example X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo C(o) = 3 O(o, 2) = 1 ? ? = 2 R(W) = 1 ? ??? = C(o) + O(o, 2) = 3 + 1 = 4
Example X = googol$ ? ?? ? ?? C(a) + O(a, R(W) 1) + 1 C(a) + O(a, ? ? ) W = go aW = ogo ? ?? R(aW) = 4 4 = 0 ogo is a substring of X S(4) = 2
Outline 1) Introduction & Background review 2) Prefix trie and Burrows-Wheeler transform 3) Exact Matching 4) Inexact Matching 5) Result & Conclusion 6) Reference
Between Exact & Inexact Matching Exact Find all exact substrings (get positions) Inexact Find all similar substrings (get positions) Bounded differences (insertion/deletion/mismatch) Reference string: X Bob spent all his money on a game called monkey money money Query string: W
An artificial example Reference string: X TTAACGTTTATTACGTTTAAGTTTAACCTT AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAAGTTTAACCTT AACG Query string: W Allowed differences: 2 To follow the procedures of exact matching, we ll scan W from right to left We have a budget of $2 from the beginning Minus 1 when one difference occurs Stop when bankrupt occurs or W is fully scanned
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACTTG AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACTTG AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACTTG AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAAGTTTAACCTT AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG Allowed differences: 2 Query string: W
Straightforward ideas Reference string: X TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG Allowed differences: 2 Query string: W
Before illustrating Something we knew in Exact- Matching In O(|W|) time, we can find all positions X: googol$ W:go In O(1) time, we find all updated positions X: googol$ W:ogo Magic 2 numbers can show all positions
Algorithm INEXRECUR(W,i,z,k,l) A Recursive function W: query string Handle W[i] in this recursion z: the remaining budgets (k,l) represents the previous interval AACG Query string: W
INEXRECUR(W,i,z,k,l) 0 Fully scanned Return the acceptable interval
INEXRECUR(W,i,z,k,l) 0 TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG AACG I is ready to collect all similar intervals Insertion to X
TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG AACG 0 TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG AACG TTAACGTTTAACTTGTTTAA-GTTTAACCTT AACG AACG deletion from X