Insights into Genome Assembly and Shotgun Sequencing

Assembling Genomes
BCH394P/364C Systems Biology / Bioinformatics
Edward Marcotte, Univ of Texas at Austin
Nature 409, 860-921(2001)
www.yourgenome.org/facts/timeline-the-human-genome-project
Beijing Genomics Institute
(Wikipedia)
“If it tastes good you should sequence it...
you should know what's in the genes of that species”
Wang Jun, Chief executive, BGI
The NovaSeq in the UT GSAF core generates
>1.4 terabases of sequence in a 1-day run
Illumina
  Many millions of 
75-150 bp reads
 
  Thousands of 1,000 to 1,000,000 (ish) 
bp reads
(Translating the cloning jargon)
Twelve quick steps for genome assembly and annotation in the classroom
PLoS Comp Biology (2020),  doi:10.1371/journal.pcbi.1008325 
Contemporary genome assembly is fairly complex, but at its core
are assembly algorithms that grew from the shotgun concept
Beverly Micro “Pure White Hell” Jigsaw Puzzle (10,000,000,000 Piece)
Thinking about the basic shotgun concept
Start with a very large set of random
sequencing reads
How might we match up the
overlapping sequences?
How can we assemble the overlapping
reads together in order to derive the
genome?
Thinking about the basic shotgun concept
At a high level, the first genomes were
sequenced by comparing pairs of reads
to find overlapping reads
Then, building a graph (
i.e.
, a network)
to represent those relationships
The genome sequence is a “walk”
across that graph
The “Overlap-Layout-Consensus” method
Overlap
:  
 
Compare all pairs of reads
    
(allow some low level of mismatches)
Layout
:  
 
Construct a graph describing the overlaps
    
Simplify the graph
    
Find the simplest path through the graph
Consensus
:  Reconcile errors among reads along that
     
 path to find the consensus sequence
read
read
sequence
overlap
Building an overlap graph
EUGENE W. MYERS. 
Journal of Computational
Biology
. Summer 1995, 2(2): 275-290
5’
3’
Reads
Overlap graph
EUGENE W. MYERS. 
Journal of Computational
Biology
. Summer 1995, 2(2): 275-290 (more or less)
Building an overlap graph
5’
3’
EUGENE W. MYERS. 
Journal of Computational
Biology
. Summer 1995, 2(2): 275-290 (more or less)
1. Remove all contained nodes & edges going to them
Simplifying an overlap graph
EUGENE W. MYERS. 
Journal of Computational
Biology
. Summer 1995, 2(2): 275-290 (more or less)
Simplifying an overlap graph
2. Transitive edge removal:
 
Given A – B – D    and    A – D , remove A – D 
EUGENE W. MYERS. 
Journal of Computational
Biology
. Summer 1995, 2(2): 275-290 (more or less)
Simplifying an overlap graph
3. 
 
If un-branched, calculate consensus sequence
 
If branched, assemble un-branched bits and then decide
  
how they fit together
EUGENE W. MYERS. 
Journal of Computational
Biology
. Summer 1995, 2(2): 275-290 (more or less)
Simplifying an overlap graph
“contig”  (assembled contiguous sequence)
This basic strategy was used for most of
the early genomes.
Also useful: “mate pairs”
2 reads separated by a known distance
Contig #1
Contig #2
Contigs can be ordered using these paired reads
to produce “scaffolds”
Read #1
Read #2
DNA fragment of known size
GigAssembler (used to assemble the public
human genome project sequence)
Jim Kent
David Haussler
Let’s take a little walk through history to see what they did…
Whole genome Assembly: big picture
http://www.nature.com/scitable/content/anatomy-of-whole-genome-assembly-20429
GigAssembler – Preprocessing
1.
 Decontaminating & Repeat Masking.
2.
 Aligning of mRNAs, ESTs, BAC ends & paired
reads against initial sequence contigs.
psLayout → BLAT
3.
 Creating an input directory (folder) structure.
RepBase + RepeatMasker
GigAssembler: Build merged
sequence contigs (“rafts”)
Sequencing quality (Phred Score)
Sequencing quality (Phred Score)
http://en.wikipedia.org/wiki/Phred_quality_score
Base-calling
Error 
Probability
We’re going to skip the remaining
details of GigAssembler (mainly of
historical interest now) to get to the
key strategy for assembling all of
the various contigs and paired end
reads into a genome
GigAssembler:
Build a “raft-ordering” graph
GigAssembler:
Build a “raft-ordering” graph
Add information from mRNAs,
ESTs, paired plasmid reads,
BAC end pairs: building a
“bridge”
Different weight to different data
type: (mRNA ~ highest)
Conflicts with the graph as
constructed so far are rejected.
Build a sequence path through each
raft.
Fill the gap with N’s.
100: between rafts
50,000: between bridged barges
Finding the shortest path across the
ordering graph using the
Bellman-Ford algorithm
http://compprog.wordpress.com/2007/11/29/one-source-shortest-path-the-bellman-ford-algorithm/
A
B
C
D
E
+
6
+
8
+
7
+
9
+
2
+
5
-
2
-
3
Find the shortest path to all nodes.
-
4
+
7
Take every edge and try to relax it (N – 1 times where N is the count of nodes)
A
B
C
D
E
+
6
+
8
+
7
+
9
+
2
+
5
-
2
-
3
Find the shortest path to all nodes.
-
4
+
7
Take every edge and try to relax it (N – 1 times where N is the count of nodes)
A
B
C
D
E
+
6
+
8
+
7
+
9
+
2
+
5
-
2
-
3
Find the shortest path to all nodes.
-
4
+
7
START
Inf.
Inf.
Inf.
Inf.
Take every edge and try to relax it (N – 1 times where N is the count of nodes)
A
B
C
D
E
+
6
+
8
+
7
+
9
+
2
+
5
-
2
-
3
Find the shortest path to all nodes.
-
4
+
7
0
S
T
A
R
T
Inf.
Inf.
+
7
(
 
A
)
+
6
(
 
A
)
Take every edge and try to relax it (N – 1 times where N is the count of nodes)
A
B
C
D
E
+
6
+
8
+
7
+
9
+
2
+
5
-
2
-
3
Find the shortest path to all nodes.
-
4
+
7
0
S
T
A
R
T
+
7
(
 
A
)
+
6
(
 
A
)
+
2
(
 
B
)
+
4
(
 
D
)
Take every edge and try to relax it (N – 1 times where N is the count of nodes)
A
B
C
D
E
+
6
+
8
+
7
+
9
+
2
+
5
-
2
-
3
Find the shortest path to all nodes.
-
4
+
7
0
S
T
A
R
T
+
7
(
 
A
)
+
2
(
 
B
)
+
4
(
 
D
)
+
2
(
 
C
)
Take every edge and try to relax it (N – 1 times where N is the count of nodes)
A
B
C
D
E
+
6
+
8
+
7
+
9
+
2
+
5
-
2
-
3
Find the shortest path to all nodes.
-
4
+
7
0
S
T
A
R
T
+
7
(
 
A
)
+
4
(
 
D
)
+
2
(
 
C
)
-
2
(
 
B
)
Take every edge and try to relax it (N – 1 times where N is the count of nodes)
A
B
C
D
E
+
6
+
8
+
7
+
9
+
2
+
5
-
2
-
3
-
4
+
7
0
S
T
A
R
T
+
7
(
 
A
)
+
4
(
 
D
)
+
2
(
 
C
)
-
2
(
 
B
)
Answer: A-D-C-B-E
Modern assemblers now work a bit differently,
using so-called 
DeBruijn graphs:
Here’s what we saw before:
Nature Biotech 
29(11):987-991 (2011)
In 
Overlap-Layout-Consensus
:
Nodes are reads
Edges are overlaps
In a DeBruijn graph:
Nature Biotech 
29(11):987-991 (2011)
Modern assemblers now work a bit differently,
using so-called 
DeBruijn graphs:
Why Eulerian?
  
From Leonhard Euler’s solution in 1735 to the
   
‘Bridges of Königsberg’ problem:
Königsberg (now Kaliningrad, Russia) had 7 bridges connecting 4
parts of the city.  
Could you visit each part of the city, walking
across each bridge only once, & finish back where you started?
Euler conceptualized it as a graph:
 
Nodes = parts of city
 
Edges = bridges
Nature Biotech 
29(11):987-991 (2011)
(Visiting every edge once = an 
Eulerian
 path)
DeBruijn graph 
assemblers tend to have nice
properties, e.g. correcting sequencing errors &
handling repeats better
S
e
q
u
e
n
c
i
n
g
 
e
r
r
o
r
s
 
a
p
p
e
a
r
 
a
s
b
u
l
g
e
s
R
e
m
o
v
i
n
g
 
t
h
e
 
b
u
l
g
e
s
c
o
r
r
e
c
t
s
 
t
h
e
 
e
r
r
o
r
s
(
e
.
g
.
 
l
e
a
v
e
s
 
t
h
e
 
r
e
d
 
p
a
t
h
)
Nature Biotech 
29(11):987-991 (2011)
Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data
Microb Informatics Exp 
(2013) doi:10.1186/2042-5783-3-2
e.g. Velvet, an example algorithm using DeBruijn graphs
Once a reference genome is assembled,
new sequencing data can ‘simply’ be
mapped to the reference.
reads
Reference genome
Mapping reads to assembled
genomes
Trapnell C, Salzberg SL, 
Nat. Biotech.
, 2009
The list is a little longer now!  e.g. see    https://en.wikipedia.org/wiki/
List_of_sequence_alignment_software#Short-Read_Sequence_Alignment
Mapping
strategies
Trapnell C, Salzberg SL, 
Nat. Biotech.
, 2009
Burroughs
Wheeler
indexing
Trapnell C, Salzberg SL, 
Nat. Biotech.
, 2009
Burroughs-Wheeler transform indexing
BWT is often used for file compression (like bzip2), 
here used to make a fast ‘lookup’ index in a genome
BWT = ‘reversible block-sorting’
Forward BWT
Reverse BWT
This sequence is
more compressible
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
BWT is remarkable because it is
reversible.
Any ideas as how you might reverse it?
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Write the
sequence as
the last column
Sort it…
Add the
columns…
Sort those…
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Add the
columns…
Sort those…
Add the
columns…
Sort those…
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Add the
columns…
Sort those…
Add the
columns…
Sort those…
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
Add the
columns…
Add the
columns…
Sort those…
The row with
the "end of file"
character at the
end is the
original text
Burroughs-Wheeler transform indexing
http://en.wikipedia.org/wiki/Burrows-Wheeler_transform
The row with the "end of file"
character at the end is the
original text
Li & Durbin, doi:10.1093bioinformatics/btp324/
The Burroughs-Wheeler transform
leads naturally to a suffix array
http://blog.avadis-ngs.com/2012/04/elegant-exact-string-match-using-bwt-2/     (& wikipedia)
The Burroughs-Wheeler transform
leads naturally to a suffix array
http://blog.thegrandlocus.com/2016/07/a-tutorial-on-burrows-wheeler-indexing-methods
e
.
g
.
 
a
p
p
l
y
i
n
g
 
B
W
T
 
t
o
 
c
o
n
s
t
r
u
c
t
 
t
h
e
 
s
u
f
f
i
x
 
a
r
r
a
y
 
o
f
 
G
A
T
G
C
G
A
G
A
G
A
T
G
 
The search can be even more efficient by using compression & various other extensions
“If string W is a substring of X, the position of each occurrence of
W in X will occur in an interval in the suffix array. This is because
all the suffixes that have W as prefix are sorted together.”
Li & Durbin, doi:10.1093bioinformatics/btp324/
Why is this efficient?
Searching a suffix array in this way cuts the search
space in half at each step, so…
A suffix array of the human genome (3.2 billion bases)
takes at most
    
log
2
(3.2 billion) + 1 = 32 steps
 
to determine if a query sequence is present or not
There are few more steps to find all the occurrences, build an efficient
real-world implementation, use compression to reduce memory and
storage space, etc., but this still illustrates the massive savings in time
and memory from constructing an index
Burroughs
Wheeler
indexing
Trapnell C, Salzberg SL, 
Nat. Biotech.
, 2009
Convert each hit back
to genome location
Slide Note
Embed
Share

Explore the world of genome assembly, shotgun sequencing, and the fundamental concepts behind deriving genomes from large sets of sequencing reads. Learn about the complexity and core algorithms involved in contemporary genome assembly processes. Discover how the shotgun concept has revolutionized genomics and enabled researchers to unravel the mysteries hidden within the genetic code.

  • Genomics
  • Shotgun Sequencing
  • Genome Assembly
  • Bioinformatics
  • Genetic Data

Uploaded on Apr 02, 2024 | 2 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Assembling Genomes BCH394P/364C Systems Biology / Bioinformatics Edward Marcotte, Univ of Texas at Austin

  2. www.yourgenome.org/facts/timeline-the-human-genome-project Nature 409, 860-921(2001)

  3. Beijing Genomics Institute If it tastes good you should sequence it... you should know what's in the genes of that species Wang Jun, Chief executive, BGI (Wikipedia)

  4. The NovaSeq in the UT GSAF core generates >1.4 terabases of sequence in a 1-day run Illumina Many millions of 75-150 bp reads

  5. Thousands of 1,000 to 1,000,000 (ish) bp reads

  6. mapping shotgun sequencing

  7. (Translating the cloning jargon)

  8. Contemporary genome assembly is fairly complex, but at its core are assembly algorithms that grew from the shotgun concept Twelve quick steps for genome assembly and annotation in the classroom PLoS Comp Biology (2020), doi:10.1371/journal.pcbi.1008325

  9. Beverly Micro Pure White Hell Jigsaw Puzzle (10,000,000,000 Piece)

  10. Thinking about the basic shotgun concept Start with a very large set of random sequencing reads How might we match up the overlapping sequences? How can we assemble the overlapping reads together in order to derive the genome?

  11. Thinking about the basic shotgun concept At a high level, the first genomes were sequenced by comparing pairs of reads to find overlapping reads Then, building a graph (i.e., a network) to represent those relationships The genome sequence is a walk across that graph

  12. The Overlap-Layout-Consensus method Overlap: Compare all pairs of reads (allow some low level of mismatches) Layout: Construct a graph describing the overlaps sequence overlap read read Simplify the graph Find the simplest path through the graph Consensus: Reconcile errors among reads along that path to find the consensus sequence

  13. Building an overlap graph 3 5 EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): 275-290

  14. Building an overlap graph Reads 3 5 Overlap graph EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): 275-290 (more or less)

  15. Simplifying an overlap graph 1. Remove all contained nodes & edges going to them EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): 275-290 (more or less)

  16. Simplifying an overlap graph 2. Transitive edge removal: Given A B D and A D , remove A D EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): 275-290 (more or less)

  17. Simplifying an overlap graph 3. If un-branched, calculate consensus sequence If branched, assemble un-branched bits and then decide how they fit together EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): 275-290 (more or less)

  18. Simplifying an overlap graph contig (assembled contiguous sequence) EUGENE W. MYERS. Journal of Computational Biology. Summer 1995, 2(2): 275-290 (more or less)

  19. This basic strategy was used for most of the early genomes. Also useful: mate pairs 2 reads separated by a known distance Read #1 DNA fragment of known size Read #2 Contigs can be ordered using these paired reads Contig #1 Contig #2 to produce scaffolds

  20. GigAssembler (used to assemble the public human genome project sequence) Jim Kent David Haussler Let s take a little walk through history to see what they did

  21. Whole genome Assembly: big picture http://www.nature.com/scitable/content/anatomy-of-whole-genome-assembly-20429

  22. GigAssembler Preprocessing 1. Decontaminating & Repeat Masking. 2. Aligning of mRNAs, ESTs, BAC ends & paired reads against initial sequence contigs. psLayout BLAT 3. Creating an input directory (folder) structure.

  23. RepBase + RepeatMasker

  24. GigAssembler: Build merged sequence contigs ( rafts )

  25. Sequencing quality (Phred Score)

  26. Sequencing quality (Phred Score) Base-calling Error Probability http://en.wikipedia.org/wiki/Phred_quality_score

  27. Were going to skip the remaining details of GigAssembler (mainly of historical interest now) to get to the key strategy for assembling all of the various contigs and paired end reads into a genome

  28. GigAssembler: Build a raft-ordering graph

  29. GigAssembler: Build a raft-ordering graph Add information from mRNAs, ESTs, paired plasmid reads, BAC end pairs: building a bridge Different weight to different data type: (mRNA ~ highest) Conflicts with the graph as constructed so far are rejected. Build a sequence path through each raft. Fill the gap with N s. 100: between rafts 50,000: between bridged barges

  30. Finding the shortest path across the ordering graph using the Bellman-Ford algorithm http://compprog.wordpress.com/2007/11/29/one-source-shortest-path-the-bellman-ford-algorithm/

  31. Find the shortest path to all nodes. Take every edge and try to relax it (N 1 times where N is the count of nodes) +5 B C -2 +6 +8 -3 A +7 -4 +7 D E +2 +9

  32. Find the shortest path to all nodes. Take every edge and try to relax it (N 1 times where N is the count of nodes) +5 B C -2 +6 +8 -3 A +7 -4 +7 D E +2 +9

  33. Find the shortest path to all nodes. Take every edge and try to relax it (N 1 times where N is the count of nodes) +5 B C Inf. Inf. -2 +6 +8 -3 A +7 START -4 +7 D E +2 Inf. Inf. +9

  34. Find the shortest path to all nodes. Take every edge and try to relax it (N 1 times where N is the count of nodes) +5 B +6 ( A) C Inf. -2 +6 +8 -3 A 0 +7 START -4 +7 D +7 E +2 Inf. +9 ( A)

  35. Find the shortest path to all nodes. Take every edge and try to relax it (N 1 times where N is the count of nodes) +5 B +6 ( A) C +4 ( D) -2 +6 +8 -3 A 0 +7 START -4 +7 D +7 E +2 +2 +9 ( A) ( B)

  36. Find the shortest path to all nodes. Take every edge and try to relax it (N 1 times where N is the count of nodes) +5 B +2 ( C) C +4 ( D) -2 +6 +8 -3 A 0 +7 START -4 +7 D +7 E +2 +2 +9 ( A) ( B)

  37. Find the shortest path to all nodes. Take every edge and try to relax it (N 1 times where N is the count of nodes) +5 B +2 ( C) C +4 ( D) -2 +6 +8 -3 A 0 +7 START -4 +7 D +7 E -2 ( B) +2 +9 ( A)

  38. Answer: A-D-C-B-E +5 B +2 ( C) C +4 ( D) -2 +6 +8 -3 A 0 +7 START -4 +7 D +7 E -2 ( B) +2 +9 ( A)

  39. Modern assemblers now work a bit differently, using so-called DeBruijn graphs: Here s what we saw before: In Overlap-Layout-Consensus: Nodes are reads Edges are overlaps Nature Biotech 29(11):987-991 (2011)

  40. Modern assemblers now work a bit differently, using so-called DeBruijn graphs: In a DeBruijn graph: Nature Biotech 29(11):987-991 (2011)

  41. Why Eulerian? From Leonhard Euler s solution in 1735 to the Bridges of K nigsberg problem: K nigsberg (now Kaliningrad, Russia) had 7 bridges connecting 4 parts of the city. Could you visit each part of the city, walking across each bridge only once, & finish back where you started? Euler conceptualized it as a graph: Nodes = parts of city Edges = bridges (Visiting every edge once = an Eulerian path) Nature Biotech 29(11):987-991 (2011)

  42. DeBruijn graph assemblers tend to have nice properties, e.g. correcting sequencing errors & handling repeats better Sequencing errors appear as bulges Removing the bulges corrects the errors (e.g. leaves the red path) Nature Biotech 29(11):987-991 (2011)

  43. e.g. Velvet, an example algorithm using DeBruijn graphs Beginner s guide to comparative bacterial genome analysis using next-generation sequence data Microb Informatics Exp (2013) doi:10.1186/2042-5783-3-2

  44. Once a reference genome is assembled, new sequencing data can simply be mapped to the reference. reads Reference genome

  45. Mapping reads to assembled genomes The list is a little longer now! e.g. see https://en.wikipedia.org/wiki/ List_of_sequence_alignment_software#Short-Read_Sequence_Alignment Trapnell C, Salzberg SL, Nat. Biotech., 2009

  46. Mapping strategies Trapnell C, Salzberg SL, Nat. Biotech., 2009

  47. Burroughs Wheeler indexing Trapnell C, Salzberg SL, Nat. Biotech., 2009

  48. Burroughs-Wheeler transform indexing BWT is often used for file compression (like bzip2), here used to make a fast lookup index in a genome BWT = reversible block-sorting Input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES This sequence is more compressible Forward BWT Output TEXYDST.E.IXIXIXXSSMPPS.B..E.S.EUSFXDIIOIIIT Reverse BWT Recovered input SIX.MIXED.PIXIES.SIFT.SIXTY.PIXIE.DUST.BOXES http://en.wikipedia.org/wiki/Burrows-Wheeler_transform

  49. Burroughs-Wheeler transform indexing http://en.wikipedia.org/wiki/Burrows-Wheeler_transform

  50. Burroughs-Wheeler transform indexing http://en.wikipedia.org/wiki/Burrows-Wheeler_transform

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#