Advancements in Genome Alignment and Sequencing Techniques

Genome alignment

Usman Roshan

Applications

•

Genome sequencing on the rise

•

Whole genome comparison provides a deeper

understanding of biology

–

Evolutionary history

–

Non-coding regions

–

Variant detection

Methods

•

General two-fold approach

•

1. Find high scoring segments between pair of

genomes.

–

Similar to BLAST like k-mer search using hash-

tables

–

Also done with suffix tree

–

Similar to short read mapping strategies

•

2. Perform constrained alignment between

high scoring segments

Longest increasing subsequence

•

Simple algorithm takes O(n

) time where n is

the input size (total numbers in sequence)

•

Can be solved in O(nlog(n)) time by creating

extra data structures and remembering where

the previous longest subsequence ended

Simple genome alignment

•

Find high scoring segments with hash tables

•

Line up high scoring segments and find

longest increasing subsequence (like in

MUMmer)

•

Align between the segments

•

Output full genome alignment

Programs and experimental

comparison

•

Alignathon

: assessment of whole genome

sequence alignment programs on simulated

data

•

Several genome alignment programs were

used

•

Since the data is simulated we know the true

alignment and this can calculate the accuracy

•

See paper on website for overview

Exact genome alignment

•

Alignment of divergence sequences is

challenging

•

How would an exact alignment method fare in

comparison to traditional methods?

•

We compare a parallel GPU approach here to

two popular methods.

Fragment of

genome X

Genome Y

Experimental setup

Simulated data: We use simulated pairwise alignments of

divergent species from the Alignathon study (Earl et. al.

2014)

Methods: LASTZ (Harris 2007), PECAN (Paten

et. al.

2008), and GPU-EXACT (previous slide)

All methods were applied without pre and post processing

and with default parameters

Accuracy:

Results on simulated data

Method

Average F-score

       Average

across 26 datasets             runtime (mins)

LASTZ

0.1013                                44

PECAN

 0.1329                               253

GPU-EXACT

0.2413                                4047

simCow.chrA        0.31                0.32             0.63

simDog.chr A

simDog.chrA         0.02               0.03             0.23

simRat.chrQ

simCow.chrB        0.26                0.33             0.54

simHuman.chrF

Results on real data

We create a reference MAF file of MAFFT aligned genes

between Haemophilus influenza and E.coli K12. We then

measure the precision only since there are no negatives.

Method

             Precision

LASTZ                               0.352

PECAN                              0.247

Our approach                    0.438

Slide Note

Embed Share

Download

Genome alignment plays a crucial role in understanding biological processes and evolutionary history. With the rise of whole genome sequencing, methods such as constrained alignment and longest increasing subsequence have been employed for accurate variant detection. Tools like BLAST, hash tables, and GPU-based approaches are used to align genomic sequences efficiently, leading to enhanced accuracy and deeper biological insights.

wnuk_mya Follow

Uploaded on Sep 07, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

Genome alignment Usman Roshan

Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology Evolutionary history Non-coding regions Variant detection

Methods General two-fold approach 1. Find high scoring segments between pair of genomes. Similar to BLAST like k-mer search using hash- tables Also done with suffix tree Similar to short read mapping strategies 2. Perform constrained alignment between high scoring segments

Longest increasing subsequence Simple algorithm takes O(n2) time where n is the input size (total numbers in sequence) Can be solved in O(nlog(n)) time by creating extra data structures and remembering where the previous longest subsequence ended

Simple genome alignment Find high scoring segments with hash tables Line up high scoring segments and find longest increasing subsequence (like in MUMmer) Align between the segments Output full genome alignment

Programs and experimental comparison Alignathon: assessment of whole genome sequence alignment programs on simulated data Several genome alignment programs were used Since the data is simulated we know the true alignment and this can calculate the accuracy See paper on website for overview

Exact genome alignment Alignment of divergence sequences is challenging How would an exact alignment method fare in comparison to traditional methods? We compare a parallel GPU approach here to two popular methods.

Input: Two whole genome sequences X and Y. Algorithm: We split genome X into short fragments of the same length and align to genome Y with the MaxSSMap program (Turki and Roshan, 2014) that uses the maximum scoring subsequence and has a fast GPU implementation. High scoring fragments constitute anchors from which a final alignment can be built. In some cases the anchors themselves serve as the genome alignment output. Fragment of genome X Genome Y Break to same length Fragment 1 Fragment 2 Fragment 3 Fragment 4 Fragment 5 Fragment 0 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5

Experimental setup Simulated data: We use simulated pairwise alignments of divergent species from the Alignathon study (Earl et. al. 2014) Methods: LASTZ (Harris 2007), PECAN (Paten et. al. 2008), and GPU-EXACT (previous slide) All methods were applied without pre and post processing and with default parameters True Positive + Accuracy: = precision precision recall True Positive False Positive = 2 Fscore + precision recall True Positive + = recall True Positive False Negative

Results on simulated data Method LASTZ PECAN GPU-EXACT Average F-score across 26 datasets runtime (mins) 0.1013 44 0.1329 253 0.2413 4047 Average F-scores on selected datasets: Data LASTZ PECAN GPU-EXACT simCow.chrA 0.31 0.32 0.63 simDog.chr A simDog.chrA 0.02 0.03 0.23 simRat.chrQ simCow.chrB 0.26 0.33 0.54 simHuman.chrF

Results on real data We create a reference MAF file of MAFFT aligned genes between Haemophilus influenza and E.coli K12. We then measure the precision only since there are no negatives. Method LASTZ 0.352 PECAN 0.247 Our approach 0.438 Precision

Advancements in Genome Alignment and Sequencing Techniques

Download Presentation

Presentation Transcript

Related

More Related Content