RNA-Seq Read Mapping and Quantitation

 
RNA-Seq: a Closer Look at Read
Mapping and Quantitation
 
Jeremy Buhler
for GEP Alumni Workshop
RNA-Seq Pipeline for
Expression Analysis
RNA Source
 
Topics for Today
 
Read mapping 
and RNA 
quantitation
 from
read counts are two key computational steps
in RNA-Seq expression analysis.
 
Mapping
: how do we do it 
fast
?
 
Quantitation
: how do we get from mapped
reads to transcript abundance?
Problem: Short-Read Mapping
 
Given
 a genome or transcriptome
database 
D
 
M
 reads – DNA seqs of some
length 
s
 
<< |D|
 
For each read, does it appear
in D with at most 
k
 
differences
,
and if so, where?
D
M
s
 
Typical Parameters
 
Database 
D
 has size 
10
7
 – 10
10
 bases
 
Number of reads 
M
 is 
10
6
 – 10
8
 
Length 
s
 is 
75-150 
(may vary among reads)
 
# differences 
k
 is 
0-3 
(added, deleted, changed)
to account for sequencing errors, polymorphisms
 
Design Pressures
 
# reads M is huge! 
 
minimal time per read
 
Cannot afford to spend O(|D|) time per read
as in BLAST 
 need to 
index database
 
in
order to spend less time matching each read
 
Index should be small enough to reside in fast
memory, so as to maximize mapping speed
 
Basic Idea: Suffix Tree Index
 
First, sort all suffixes of database string D to
create sorted 
suffix array
 
A.
 
Second, merge all common prefixes of
adjacent strings in A to create a 
suffix tree
 
T.
 
Leaves of tree correspond to suffixes of D.
Construct a Suffix Array (A)
acagaccaga$
0
1
2
3
4
5
6
7
8
9
10
D
 
= 
Suffix
Position
Suffix Tree Example
 
9
T
A
 
9 
 a$
 
 
10
  $
Suffix
 
Rapid Matching vs Suffix Tree
 
Where is 
cag
?
 
Can find all
matches to a read
in D using time
proportional to
read length 
s.
 
Rapid Matching vs Suffix Tree
 
Where is 
cag
?
 
Can find all
matches to a read
in D using time
proportional to
read length 
s.
 
Rapid Matching vs Suffix Tree
 
Where is 
cag
?
 
Can find all
matches to a read
in D using time
proportional to
read length 
s.
Rapid Matching vs Suffix Tree
Where is 
cag
?
Can find all
matches to a read
in D using time
proportional to
read length 
s.
 
Extension to Inexact Matching
 
To permit matches with k 
substitutions
, try
multiple paths, but charge for each mismatch.
(Try searching for “aca” with one mismatch.)
 
To permit matches with k 
differences
, can do
dynamic programming 
or
 heuristic search 
to
compute 
edit distance
 of read against many paths
in tree.
 
Descent stops for a read when we hit bottom of
tree or find that path requires > k differences.
Spliced read mapping (TopHat)
Genome
Reads
Burrows-Wheeler Transform (BWT)
 
0
  acagaccaga$
 
1
  cagaccaga$
 
2
  agaccaga$
 
3
  gaccaga$
 
4
  accaga$
 
5
  ccaga$
 
6
  caga$
 
7
  aga$
 
8
  ga$
 
9 
 a$
 
10
  $
Suffix Array
 
BWT: 
aaaacccg$ga
0
1
2
3
4
5
6
7
8
9
10
 
D
 = 
acagaccaga$
 
acagaccaga
 
D
 = 
acagaccaga
$
 
 
9 
 a$
acagaccag
 
D
 = 
acagaccag
a$
 
“Virtual Tree”: FM-Index
 
Suffix trees need 10-100 bytes memory/base
indexed
 
Theorem
 [Ferragina & Manzini, 2000]: we can
build an 
O(1)-time oracle 
that, given “suffix tree
posn” corresponding to a string 
z
, returns posn
corresponding to one-char extension 
z.c
 
Hence, we can simulate suffix tree matching
without explicitly storing the tree!
 
Space cost: ~1 byte/base
 
Commonly Used Mapping Software
 
Bowtie
 [Langmead 
et al.
 2009, 2012]
 
BWA
 [Li & Durbin 2009, 2010]
 
SOAP2
 [Li 
et al.
 2009]
 
All these tools use variants of
FM-index approach.
 
On to Quantitation…
 
See the “
RNA Quantitation from RNA-Seq Data
PowerPoint presentation and lecture notes
Slide Note

12/25/2022

Embed
Share

Dive into the world of RNA-Seq expression analysis through topics such as read mapping, quantitation, and the challenges faced in short-read mapping. Discover the typical parameters, design pressures, and basic ideas behind suffix tree index construction for efficient handling of large reads and databases in RNA-Seq analysis.

  • RNA-Seq
  • Read Mapping
  • Quantitation
  • Expression Analysis
  • Suffix Tree

Uploaded on Sep 17, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. RNA-Seq: a Closer Look at Read Mapping and Quantitation Jeremy Buhler for GEP Alumni Workshop

  2. RNA-Seq Pipeline for Expression Analysis RNA-Seq Read Count per Transcript RNA Source RNA Reads 37251 20653 9827 5121 RNA Abundance per Transcript

  3. Topics for Today Read mapping and RNA quantitation from read counts are two key computational steps in RNA-Seq expression analysis. Mapping: how do we do it fast? Quantitation: how do we get from mapped reads to transcript abundance?

  4. Problem: Short-Read Mapping D Given a genome or transcriptome database D M reads DNA seqs of some length s << |D| M For each read, does it appear in D with at most k differences, and if so, where? s

  5. Typical Parameters Database D has size 107 1010 bases Number of reads M is 106 108 Length s is 75-150 (may vary among reads) # differences k is 0-3 (added, deleted, changed) to account for sequencing errors, polymorphisms

  6. Design Pressures # reads M is huge! minimal time per read Cannot afford to spend O(|D|) time per read as in BLAST need to index database in order to spend less time matching each read Index should be small enough to reside in fast memory, so as to maximize mapping speed

  7. Basic Idea: Suffix Tree Index First, sort all suffixes of database string D to create sorted suffix array A. Second, merge all common prefixes of adjacent strings in A to create a suffix tree T. Leaves of tree correspond to suffixes of D.

  8. Construct a Suffix Array (A) 0 1 2 3 4 5 6 7 8 9 10 acagaccaga$ D = A Position Suffix Suffix 10 $ 10 $ 9 a$ 9 a$ 8 ga$ 0 acagaccaga$ 4 accaga$ 7 aga$ Sort by suffix 7 aga$ 6 caga$ 2 agaccaga$ 6 caga$ 5 ccaga$ 4 accaga$ 3 gaccaga$ 1 cagaccaga$ 5 ccaga$ 8 ga$ 2 agaccaga$ 1 cagaccaga$ 3 gaccaga$ 0 acagaccaga$

  9. Suffix Tree Example 0 1 2 3 4 5 6 7 8 9 10 D = acagaccaga$ T A Suffix a c g a 10 $ 9 a$ $ g a c c a g a 0 aca 4 acc 7 aga$ c $ c a $ c 2 agac 6 caga$ $ c 1 cagac 5 cca 8 ga$ 3 gac 9 0 4 7 2 6 1 5 8 3

  10. Rapid Matching vs Suffix Tree Where is cag? Can find all matches to a read in D using time proportional to read length s. a c g a $ g a c c a g a c $ c a $ c $ c 9 0 4 7 2 6 1 5 8 3

  11. Rapid Matching vs Suffix Tree Where is cag? Can find all matches to a read in D using time proportional to read length s. a c g a $ g a c c a g a c $ c a $ c $ c 9 0 4 7 2 6 1 5 8 3

  12. Rapid Matching vs Suffix Tree Where is cag? Can find all matches to a read in D using time proportional to read length s. a c g a $ g a c c a g a c $ c a $ c $ c 9 0 4 7 2 6 1 5 8 3

  13. Rapid Matching vs Suffix Tree Where is cag? Can find all matches to a read in D using time proportional to read length s. a c g a $ g a c c a g a c $ c a $ c $ c 0 1 2 3 4 5 6 7 8 9 10 D = acagaccaga$ 9 0 4 7 2 6 1 5 8 3

  14. Extension to Inexact Matching To permit matches with k substitutions, try multiple paths, but charge for each mismatch. (Try searching for aca with one mismatch.) To permit matches with k differences, can do dynamic programming or heuristic search to compute edit distance of read against many paths in tree. Descent stops for a read when we hit bottom of tree or find that path requires > k differences.

  15. Spliced read mapping (TopHat) Reads Genome Map unspliced reads Initially Unmapped (IUM) reads S1 S3 S1 GT AG IUM Junction flanking index S2 part 1 S2 part 2 S1 S2 S2 S3 Split IUM into segments

  16. Burrows-Wheeler Transform (BWT) 0 1 2 3 4 5 6 7 8 9 10 D = acagaccaga$ D = acagaccaga$ D = acagaccaga$ Burrows-Wheeler Matrix acagaccaga Suffix Array 10 $ 9 a$acagaccag 10 $ 9 a$ 0 acagaccaga$ 4 accaga$acag 7 aga$acagacc 0 acagaccaga$ 4 accaga$ 7 aga$ 2 agaccaga$ac 6 caga$acagac 2 agaccaga$ 6 caga$ 1 cagaccaga$a 5 ccaga$acaga 8 ga$acagacca 1 cagaccaga$ 5 ccaga$ 8 ga$ 3 gaccaga$aca 3 gaccaga$ BWT: aaaacccg$ga

  17. Virtual Tree: FM-Index Suffix trees need 10-100 bytes memory/base indexed Theorem [Ferragina & Manzini, 2000]: we can build an O(1)-time oracle that, given suffix tree posn corresponding to a string z, returns posn corresponding to one-char extension z.c Hence, we can simulate suffix tree matching without explicitly storing the tree! Space cost: ~1 byte/base

  18. Commonly Used Mapping Software Bowtie [Langmead et al. 2009, 2012] BWA [Li & Durbin 2009, 2010] SOAP2 [Li et al. 2009] All these tools use variants of FM-index approach.

  19. On to Quantitation See the RNA Quantitation from RNA-Seq Data PowerPoint presentation and lecture notes

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#