Advancements in Genome Alignment and Sequencing Techniques

 
Genome alignment
 
Usman Roshan
 
Applications
 
Genome sequencing on the rise
Whole genome comparison provides a deeper
understanding of biology
Evolutionary history
Non-coding regions
Variant detection
 
Methods
 
General two-fold approach
1. Find high scoring segments between pair of
genomes.
Similar to BLAST like k-mer search using hash-
tables
Also done with suffix tree
Similar to short read mapping strategies
2. Perform constrained alignment between
high scoring segments
 
Longest increasing subsequence
 
Simple algorithm takes O(n
2
) time where n is
the input size (total numbers in sequence)
Can be solved in O(nlog(n)) time by creating
extra data structures and remembering where
the previous longest subsequence ended
 
Simple genome alignment
 
Find high scoring segments with hash tables
Line up high scoring segments and find
longest increasing subsequence (like in
MUMmer)
Align between the segments
Output full genome alignment
 
Programs and experimental
comparison
 
Alignathon
: assessment of whole genome
sequence alignment programs on simulated
data
Several genome alignment programs were
used
Since the data is simulated we know the true
alignment and this can calculate the accuracy
See paper on website for overview
 
Exact genome alignment
 
Alignment of divergence sequences is
challenging
How would an exact alignment method fare in
comparison to traditional methods?
We compare a parallel GPU approach here to
two popular methods.
 
Fragment of
genome X
 
Genome Y
 
Experimental setup
 
Simulated data: We use simulated pairwise alignments of
divergent species from the Alignathon study (Earl et. al.
2014)
 
Methods: LASTZ (Harris 2007), PECAN (Paten 
et. al.
2008), and GPU-EXACT (previous slide)
 
All methods were applied without pre and post processing
and with default parameters
 
Accuracy:
 
Results on simulated data
 
Method
   
Average F-score
  
       Average
    
across 26 datasets             runtime (mins)
LASTZ                 
 
0.1013                                44
PECAN               
 
 0.1329                               253
GPU-EXACT
  
0.2413                                4047
 
F
-
s
c
o
r
e
s
 
o
n
 
s
e
l
e
c
t
e
d
 
d
a
t
a
s
e
t
s
:
D
a
t
a
L
A
S
T
Z
 
 
 
 
 
 
 
 
 
 
 
P
E
C
A
N
 
 
 
 
 
 
 
 
G
P
U
-
E
X
A
C
T
simCow.chrA        0.31                0.32             0.63
simDog.chr A
 
simDog.chrA         0.02               0.03             0.23
simRat.chrQ
 
simCow.chrB        0.26                0.33             0.54
simHuman.chrF
 
Results on real data
 
We create a reference MAF file of MAFFT aligned genes
between Haemophilus influenza and E.coli K12. We then
measure the precision only since there are no negatives.
 
Method
   
             Precision
LASTZ                               0.352
PECAN                              0.247
Our approach                    0.438
Slide Note
Embed
Share

Genome alignment plays a crucial role in understanding biological processes and evolutionary history. With the rise of whole genome sequencing, methods such as constrained alignment and longest increasing subsequence have been employed for accurate variant detection. Tools like BLAST, hash tables, and GPU-based approaches are used to align genomic sequences efficiently, leading to enhanced accuracy and deeper biological insights.

  • Genome Alignment
  • Sequencing Techniques
  • Evolutionary History
  • Variant Detection
  • BLAST

Uploaded on Sep 07, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. Genome alignment Usman Roshan

  2. Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology Evolutionary history Non-coding regions Variant detection

  3. Methods General two-fold approach 1. Find high scoring segments between pair of genomes. Similar to BLAST like k-mer search using hash- tables Also done with suffix tree Similar to short read mapping strategies 2. Perform constrained alignment between high scoring segments

  4. Longest increasing subsequence Simple algorithm takes O(n2) time where n is the input size (total numbers in sequence) Can be solved in O(nlog(n)) time by creating extra data structures and remembering where the previous longest subsequence ended

  5. Simple genome alignment Find high scoring segments with hash tables Line up high scoring segments and find longest increasing subsequence (like in MUMmer) Align between the segments Output full genome alignment

  6. Programs and experimental comparison Alignathon: assessment of whole genome sequence alignment programs on simulated data Several genome alignment programs were used Since the data is simulated we know the true alignment and this can calculate the accuracy See paper on website for overview

  7. Exact genome alignment Alignment of divergence sequences is challenging How would an exact alignment method fare in comparison to traditional methods? We compare a parallel GPU approach here to two popular methods.

  8. Input: Two whole genome sequences X and Y. Algorithm: We split genome X into short fragments of the same length and align to genome Y with the MaxSSMap program (Turki and Roshan, 2014) that uses the maximum scoring subsequence and has a fast GPU implementation. High scoring fragments constitute anchors from which a final alignment can be built. In some cases the anchors themselves serve as the genome alignment output. Fragment of genome X Genome Y Break to same length Fragment 1 Fragment 2 Fragment 3 Fragment 4 Fragment 5 Fragment 0 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

  9. Experimental setup Simulated data: We use simulated pairwise alignments of divergent species from the Alignathon study (Earl et. al. 2014) Methods: LASTZ (Harris 2007), PECAN (Paten et. al. 2008), and GPU-EXACT (previous slide) All methods were applied without pre and post processing and with default parameters True Positive + Accuracy: = precision precision recall True Positive False Positive = 2 Fscore + precision recall True Positive + = recall True Positive False Negative

  10. Results on simulated data Method LASTZ PECAN GPU-EXACT Average F-score across 26 datasets runtime (mins) 0.1013 44 0.1329 253 0.2413 4047 Average F-scores on selected datasets: Data LASTZ PECAN GPU-EXACT simCow.chrA 0.31 0.32 0.63 simDog.chr A simDog.chrA 0.02 0.03 0.23 simRat.chrQ simCow.chrB 0.26 0.33 0.54 simHuman.chrF

  11. Results on real data We create a reference MAF file of MAFFT aligned genes between Haemophilus influenza and E.coli K12. We then measure the precision only since there are no negatives. Method LASTZ 0.352 PECAN 0.247 Our approach 0.438 Precision

More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#