GPress: A Framework for Querying Genome Annotation Files in Compressed Form

GPress: a framework

for querying

genome annotation files in a

compressed form

Qingxi Meng

Advisor: Prof. Idoia Ochoa

th

 CSL Student Conference

Table of Contents

1.

Motivations

2.

Method Overview

•

Compression

•

Random  Access

3.

Extension to other files: expression files

4.

Conclusion

1.

Motivations

•

5000 Insect Genome Initiative (i5k)

•

Plant Genome Intiative

•

Genome 10K (G10K)

Many large genomic projects use GFF files

•

Storage of GFF files takes more and more space

•

GFF files are frequently revised, annotated,

queried and streamed

   Challenges

•

tabix: use a general compressor

•

gffutils by python: don’t compress data

•

gffread:   works directly on original file

Current GFF utilities

Our framework:

GPress

◦

Save Space:

compression of GFF files

◦

Supports quick searches:

random  access

2.

Method Overview

chr1       HAVANA       gene      11869      14409      .      +      .        gene_id "ENSG00000223972.5";

gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2;

havana_gene "OTTHUMG00000000961.2";

seqname

source

start

end

  feature

score

strand

frame

attribute

Each GFF file consists of 9 columns of data

chr1       HAVANA       gene      11869      14409      .      +      .        gene_id "ENSG00000223972.5";

gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2;

havana_gene "OTTHUMG00000000961.2";

seqname

source

start

end

  feature

score

strand

frame

attribute

Idea:

divide the columns into different data streams

Block Diagram for

GPress

ncoder

Average a 98% reduction in original file

Results

But we still need to decompress the

whole file for searches

Idea:

compress in blocks

Random Access

Given an ID (e.g. ENSE00003486434.1), the GPress can

print out the item’s information in around 2.5 seconds.

Given a range of coordinates (e.g. 10000 to 100000),

the GPress can print out all items in around 4.5 seconds.

Results

3.

Extention to other files

ENST00000567887.5|ENSG00000004059.10|OTTHUMG00000023246.5|OTTHUMT0

0000059567.2|ARF5-001|ARF5|1103|protein_coding|    01c9c486-321f-4ebc-ade7-

bbe6ea5c4a6e    5060.01331481811    142.006325562341    892.1653 1103

Expression Files

Use Same Idea

Results

•

Reduce expression file size by more than 68%

compared to gzip.

•

Retrieves the information within seconds.

Conclusion

Conclusion

•

Average a 98% reduction in original GFF file

•

Around twice better than gzip

•

Supports queries in seconds

•

Can support expression files

Future Work:

Other annotation files with similar structure

•

WIG files

•

VCF files

•

BED files

•

…….

Thanks!

Any questions?

Slide Note

Embed Share

Download

Genome projects generate large GFF files which require significant storage space. GPress offers a solution by compressing GFF files while allowing quick searches and random access. The framework addresses challenges faced by current GFF utilities, providing a more efficient approach to managing and querying genomic data.

posy_375 Follow

Uploaded on Sep 24, 2024 | 0 Views

Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript

GPress: a framework for querying genome annotation files in a genome annotation files in a compressed form compressed form for querying Qingxi Meng Advisor: Prof. Idoia Ochoa 15thCSL Student Conference

Table of Contents 1. Motivations 2. Method Overview Compression Random Access 3. Extension to other files: expression files 4. Conclusion 2

1. Motivations

Many large genomic projects use GFF files 5000 Insect Genome Initiative (i5k) Plant Genome Intiative Genome 10K (G10K) 5

Challenges Storage of GFF files takes more and more space GFF files are frequently revised, annotated, queried and streamed 6

Current GFF utilities tabix: use a general compressor gffutils by python: don t compress data gffread: works directly on original file 7

Our framework: GPress Save Space: Save Space: compression of GFF files Supports quick searches: Supports quick searches: random access 8

2. Method Overview

seqname source feature start end score strand frame attribute chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; 10

Each GFF file consists of 9 columns of data Column 1: Seqname (string) Column 2: Source (string) Column 3: Feature (string) Column 4: Start (integer) Column 5: End (integer) Column 6: Score (floating point) Column 7: Strand (char) Column 8: Frame (integer) Column 9: Attribute (string) 11

Idea: divide the columns into different data streams seqname source feature start end score strand frame attribute chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; 12

Block Diagram for GPress Encoder 13

Results Average a 98% reduction in original file But we still need to decompress the whole file for searches Idea: compress in blocks 14

Random Access 15

Results ID search Given an ID (e.g. ENSE00003486434.1), the GPress can print out the item s information in around 2.5 seconds. Range search Given a range of coordinates (e.g. 10000 to 100000), the GPress can print out all items in around 4.5 seconds. 16

3. Extention to other files

Expression Files ENST00000567887.5|ENSG00000004059.10|OTTHUMG00000023246.5|OTTHUMT0 0000059567.2|ARF5-001|ARF5|1103|protein_coding| 01c9c486-321f-4ebc-ade7- bbe6ea5c4a6e 5060.01331481811 142.006325562341 892.1653 1103 Column 1: target ID Column 2: sample Column 3: EST counts Column 4: tpm value Column 5: effective length Column 6: length 18