Advancements in Genome Alignment and Sequencing Techniques
Genome alignment plays a crucial role in understanding biological processes and evolutionary history. With the rise of whole genome sequencing, methods such as constrained alignment and longest increasing subsequence have been employed for accurate variant detection. Tools like BLAST, hash tables, and GPU-based approaches are used to align genomic sequences efficiently, leading to enhanced accuracy and deeper biological insights.
Download Presentation
Please find below an Image/Link to download the presentation.
The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.
E N D
Presentation Transcript
Genome alignment Usman Roshan
Applications Genome sequencing on the rise Whole genome comparison provides a deeper understanding of biology Evolutionary history Non-coding regions Variant detection
Methods General two-fold approach 1. Find high scoring segments between pair of genomes. Similar to BLAST like k-mer search using hash- tables Also done with suffix tree Similar to short read mapping strategies 2. Perform constrained alignment between high scoring segments
Longest increasing subsequence Simple algorithm takes O(n2) time where n is the input size (total numbers in sequence) Can be solved in O(nlog(n)) time by creating extra data structures and remembering where the previous longest subsequence ended
Simple genome alignment Find high scoring segments with hash tables Line up high scoring segments and find longest increasing subsequence (like in MUMmer) Align between the segments Output full genome alignment
Programs and experimental comparison Alignathon: assessment of whole genome sequence alignment programs on simulated data Several genome alignment programs were used Since the data is simulated we know the true alignment and this can calculate the accuracy See paper on website for overview
Exact genome alignment Alignment of divergence sequences is challenging How would an exact alignment method fare in comparison to traditional methods? We compare a parallel GPU approach here to two popular methods.
Input: Two whole genome sequences X and Y. Algorithm: We split genome X into short fragments of the same length and align to genome Y with the MaxSSMap program (Turki and Roshan, 2014) that uses the maximum scoring subsequence and has a fast GPU implementation. High scoring fragments constitute anchors from which a final alignment can be built. In some cases the anchors themselves serve as the genome alignment output. Fragment of genome X Genome Y Break to same length Fragment 1 Fragment 2 Fragment 3 Fragment 4 Fragment 5 Fragment 0 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5
Experimental setup Simulated data: We use simulated pairwise alignments of divergent species from the Alignathon study (Earl et. al. 2014) Methods: LASTZ (Harris 2007), PECAN (Paten et. al. 2008), and GPU-EXACT (previous slide) All methods were applied without pre and post processing and with default parameters True Positive + Accuracy: = precision precision recall True Positive False Positive = 2 Fscore + precision recall True Positive + = recall True Positive False Negative
Results on simulated data Method LASTZ PECAN GPU-EXACT Average F-score across 26 datasets runtime (mins) 0.1013 44 0.1329 253 0.2413 4047 Average F-scores on selected datasets: Data LASTZ PECAN GPU-EXACT simCow.chrA 0.31 0.32 0.63 simDog.chr A simDog.chrA 0.02 0.03 0.23 simRat.chrQ simCow.chrB 0.26 0.33 0.54 simHuman.chrF
Results on real data We create a reference MAF file of MAFFT aligned genes between Haemophilus influenza and E.coli K12. We then measure the precision only since there are no negatives. Method LASTZ 0.352 PECAN 0.247 Our approach 0.438 Precision