DNA Data Archival: Solving Read Consensus Using OneJoin Algorithm

Slide Note
Embed
Share

DNA data storage presents challenges in archiving digital information efficiently due to the nature of biological media. This article delves into the complexities of DNA data storage, emphasizing the importance of robust archival solutions. The OneJoin algorithm offers a scalable and cross-architecture approach to solving DNA read consensus, ensuring accurate clustering and editing of noisy reads. By implementing this algorithm with oneAPI in DPC++, data-parallel kernels, bucketization for Hamming LSH, and cross-architecture execution models, significant advancements in DNA data archival techniques can be achieved.


Uploaded on Sep 12, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. OneOligo oneAPI for DNA data storage Eugenio Marinelli Supervisor: Raja Appuswamy EURECOM

  2. DNA: A biological media for archiving digital data 80% enterprise data is cold, and increasing at 60% CAGR [Horison] 60% of archival data stored longer than 20 years [SNIA] Dense Durable [SemiSynBio Roadmap, 2018] [The Guardian, 2017] 12/09/2024 - p 1

  3. DNA Data Archival: Overview 12/09/2024 - p 2

  4. Solving DNA Read Consensus Edit Clustering Consensus Noisy Reads Similarity Join ACTGATGTGATCC TA ACTGATGTGATGCGTA ACTGATGTGATCC TA ACTGATGTGATGCGTA ACTGATGTGATGCGCTA ACTGATGTGATCC - TA ATCGTGCATAGTCAGT ACTGATGTGATGCGTA ACTGATGTGATCC TA ACTGATGTGATGCGTA ACTGATGTGATGCGTA ATCGTGCATAGTCAGT ACTGATGTGATGCGTA ACTGATGTGATGCGCTA ATCGTGCATAGTCAGT ATCGTGCATAGTCTGT ATCGTGCATAGTCTGT ACTGATGTGATGCGTCA ATCGTGCATAGTCAGT ATCGTGCATAGTCTGT N = O(100M to 1B) strings O(N2) comparison operations O(k2) per edit distance (k=length) O(100k -1M) strings 12/09/2024 - p 3

  5. OneJoin (Algorithm): Scalable, Cross- Architecture Edit Similarity Join with oneAPI 1. ACTGATGTGATCC - TA 2. ATCGTGCATAGTCAGT 3. ACTGATGTGATGCGTA 1. AACCCTGGGAAA 2. AAAATTCCCGTG 3.AAAAACTTTGGG 1. AACCCTGGGAAA 3.AAAAACTTTGGG 5. AACCTTTGAAATG 4. ATCGTGCATAGTCTGT 5. ACTGATGTGATGCGTCA 4. ATTTTCCCGTTTG 5. ACCTTTGAAATG Bucket 1 1. ACTGATGTGATCC TA 3. ACTGATGTGATGCGTA Low-distortion Embedding Converts strings s1, s2 S into s 1, s 2 S such that with Pr >= 0.999 Ham-dist(s 1, s 2) < O(Ed-dist(s1, s2)2) 1. ACTGATGTGATCC TA 5. ACTGATGTGATGCGTCA 2. AAAATTCCCGTG 4. ATTTTCCCGTTTG 5. AACCTTTGAAATG 3. ACTGATGTGATGCGTA 5. ACTGATGTGATGCGTCA Bucket N 2. ATCGTGCATAGTCAGT 4. ATCGTGCATAGTCTGT Hamming LSH Hash embedded strings into buckets using a family of hash functions s.t probability of two strings being similar within a bucket is very high. Verification & Pair Generation Compute ed-dist within each bucket to find all possible pairs. 12/09/2024 - p 4

  6. OneJoin (Implementation): Scalable, Cross- Architecture Edit Similarity Join with oneAPI Implemented in DPC++ Scalable Key steps implemented as data-parallel DPC++ kernels Embedding, Bucketization for Hamming LSH, Candidate generation Control logic on the host exploiting oneTBB Data structure reorganization at join stages (sort, dedup, ) Cross-architecture Across CPU & GPU fork-join execution model Cross-vendor using CodePlay LLVM backend 12/09/2024 - p 5

  7. OneJoin & DNA Reads Dataset 5M, 10M, and 20M reads Hardware 12-core Intel core i9- 10920X CPU NVIDIA-RTX- 2080 (compiled with CodePlay LLVM Backend) 12/09/2024 - p 6

  8. Conclusion OneJoin scales well due to DPC++ & oneAPI DPC++ provides portable cross-architecture parallelism DNA read consensus scales well due to OneJoin Fundamental solution applicable to other problems (sequence assembly/alignment, read clustering, ) Code and more details https://devmesh.intel.com/projects/oneoligo-860c37 A shout-out to devcloud and oneAPI developer forum Thanks to devcloud for enabling rapid prototyping with diverse hardware Thanks to forum admins for their timely support 12/09/2024 - p 7

Related


More Related Content