GPress: A Framework for Querying Genome Annotation Files in Compressed Form

 
GPress: a framework 
for querying
genome annotation files in a
compressed form
 
Qingxi Meng
Advisor: Prof. Idoia Ochoa
15
th
 CSL Student Conference
 
Table of Contents
 
1.
Motivations
2.
Method Overview
Compression
Random  Access
3.
Extension to other files: expression files
4.
Conclusion
 
2
 
1.
Motivations
 
5000 Insect Genome Initiative (i5k)
Plant Genome Intiative
Genome 10K (G10K)
 
5
 
Many large genomic projects use GFF files
 
Storage of GFF files takes more and more space
GFF files are frequently revised, annotated,
queried and streamed
 
6
 
   Challenges
 
tabix: use a general compressor
gffutils by python: don’t compress data
gffread:   works directly on original file
 
7
 
Current GFF utilities
 
Our framework: 
GPress
 
Save Space: 
compression of GFF files
Supports quick searches: 
random  access
 
8
 
2.
Method Overview
 
10
 
 
chr1       HAVANA       gene      11869      14409      .      +      .        gene_id "ENSG00000223972.5";
gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2;
havana_gene "OTTHUMG00000000961.2";
 
seqname
 
source
 
start
 
    end
 
  feature
 
score
 
strand
 
frame
 
attribute
 
Each GFF file consists of 9 columns of data
 
11
 
C
o
l
u
m
n
 
1
:
 
S
e
q
n
a
m
e
 
(
s
t
r
i
n
g
)
C
o
l
u
m
n
 
2
:
 
S
o
u
r
c
e
 
(
s
t
r
i
n
g
)
C
o
l
u
m
n
 
3
:
 
F
e
a
t
u
r
e
 
(
s
t
r
i
n
g
)
C
o
l
u
m
n
 
4
:
 
S
t
a
r
t
 
(
i
n
t
e
g
e
r
)
C
o
l
u
m
n
 
5
:
 
E
n
d
 
(
i
n
t
e
g
e
r
)
C
o
l
u
m
n
 
6
:
 
S
c
o
r
e
 
(
f
l
o
a
t
i
n
g
 
p
o
i
n
t
)
C
o
l
u
m
n
 
7
:
 
S
t
r
a
n
d
 
(
c
h
a
r
)
C
o
l
u
m
n
 
8
:
 
F
r
a
m
e
 
(
i
n
t
e
g
e
r
)
C
o
l
u
m
n
 
9
:
 
A
t
t
r
i
b
u
t
e
 
(
s
t
r
i
n
g
)
 
12
 
 
chr1       HAVANA       gene      11869      14409      .      +      .        gene_id "ENSG00000223972.5";
gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2;
havana_gene "OTTHUMG00000000961.2";
 
seqname
 
source
 
start
 
    end
 
  feature
 
score
 
strand
 
frame
 
attribute
 
Idea:
divide the columns into different data streams
 
Block Diagram for 
GPress 
E
ncoder
 
13
 
14
 
Average a 98% reduction in original file
 
Results
 
But we still need to decompress the
whole file for searches
 
Idea: 
compress in blocks
 
Random Access
 
15
 
16
 
I
D
 
s
e
a
r
c
h
 
Given an ID (e.g. ENSE00003486434.1), the GPress can
print out the item’s information in around 2.5 seconds.
 
R
a
n
g
e
 
s
e
a
r
c
h
 
Given a range of coordinates (e.g. 10000 to 100000),
the GPress can print out all items in around 4.5 seconds.
 
Results
 
3.
Extention to other files
 
18
 
ENST00000567887.5|ENSG00000004059.10|OTTHUMG00000023246.5|OTTHUMT0
0000059567.2|ARF5-001|ARF5|1103|protein_coding|    01c9c486-321f-4ebc-ade7-
bbe6ea5c4a6e    5060.01331481811    142.006325562341    892.1653 1103
 
C
o
l
u
m
n
 
1
:
 
t
a
r
g
e
t
 
I
D
C
o
l
u
m
n
 
2
:
 
s
a
m
p
l
e
C
o
l
u
m
n
 
3
:
 
E
S
T
 
c
o
u
n
t
s
C
o
l
u
m
n
 
4
:
 
t
p
m
 
v
a
l
u
e
C
o
l
u
m
n
 
5
:
 
e
f
f
e
c
t
i
v
e
 
l
e
n
g
t
h
C
o
l
u
m
n
 
6
:
 
l
e
n
g
t
h
 
 
Expression Files
 
Use Same Idea
 
19
 
20
 
Results
 
Reduce expression file size by more than 68%
compared to gzip.
Retrieves the information within seconds.
 
4
.
Conclusion
 
22
 
Conclusion
 
Average a 98% reduction in original GFF file
Around twice better than gzip
Supports queries in seconds
Can support expression files
 
23
 
Future Work:
Other annotation files with similar structure
 
WIG files
VCF files
BED files
…….
 
24
 
Thanks!
 
Any questions?
Slide Note
Embed
Share

Genome projects generate large GFF files which require significant storage space. GPress offers a solution by compressing GFF files while allowing quick searches and random access. The framework addresses challenges faced by current GFF utilities, providing a more efficient approach to managing and querying genomic data.

  • Genome Annotation
  • Data Compression
  • GPress Framework
  • Genomic Projects
  • Querying Data

Uploaded on Sep 24, 2024 | 0 Views


Download Presentation

Please find below an Image/Link to download the presentation.

The content on the website is provided AS IS for your information and personal use only. It may not be sold, licensed, or shared on other websites without obtaining consent from the author. Download presentation by click this link. If you encounter any issues during the download, it is possible that the publisher has removed the file from their server.

E N D

Presentation Transcript


  1. GPress: a framework for querying genome annotation files in a genome annotation files in a compressed form compressed form for querying Qingxi Meng Advisor: Prof. Idoia Ochoa 15thCSL Student Conference

  2. Table of Contents 1. Motivations 2. Method Overview Compression Random Access 3. Extension to other files: expression files 4. Conclusion 2

  3. 1. Motivations

  4. Many large genomic projects use GFF files 5000 Insect Genome Initiative (i5k) Plant Genome Intiative Genome 10K (G10K) 5

  5. Challenges Storage of GFF files takes more and more space GFF files are frequently revised, annotated, queried and streamed 6

  6. Current GFF utilities tabix: use a general compressor gffutils by python: don t compress data gffread: works directly on original file 7

  7. Our framework: GPress Save Space: Save Space: compression of GFF files Supports quick searches: Supports quick searches: random access 8

  8. 2. Method Overview

  9. seqname source feature start end score strand frame attribute chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; 10

  10. Each GFF file consists of 9 columns of data Column 1: Seqname (string) Column 2: Source (string) Column 3: Feature (string) Column 4: Start (integer) Column 5: End (integer) Column 6: Score (floating point) Column 7: Strand (char) Column 8: Frame (integer) Column 9: Attribute (string) 11

  11. Idea: divide the columns into different data streams seqname source feature start end score strand frame attribute chr1 HAVANA gene 11869 14409 . + . gene_id "ENSG00000223972.5"; gene_type "transcribed_unprocessed_pseudogene"; gene_name "DDX11L1"; level 2; havana_gene "OTTHUMG00000000961.2"; 12

  12. Block Diagram for GPress Encoder 13

  13. Results Average a 98% reduction in original file But we still need to decompress the whole file for searches Idea: compress in blocks 14

  14. Random Access 15

  15. Results ID search Given an ID (e.g. ENSE00003486434.1), the GPress can print out the item s information in around 2.5 seconds. Range search Given a range of coordinates (e.g. 10000 to 100000), the GPress can print out all items in around 4.5 seconds. 16

  16. 3. Extention to other files

  17. Expression Files ENST00000567887.5|ENSG00000004059.10|OTTHUMG00000023246.5|OTTHUMT0 0000059567.2|ARF5-001|ARF5|1103|protein_coding| 01c9c486-321f-4ebc-ade7- bbe6ea5c4a6e 5060.01331481811 142.006325562341 892.1653 1103 Column 1: target ID Column 2: sample Column 3: EST counts Column 4: tpm value Column 5: effective length Column 6: length 18

  18. Use Same Idea 19

  19. Results Reduce expression file size by more than 68% compared to gzip. Retrieves the information within seconds. 20

  20. 4. Conclusion

  21. Conclusion Average a 98% reduction in original GFF file Around twice better than gzip Supports queries in seconds Can support expression files 22

  22. Future Work: Other annotation files with similar structure WIG files VCF files BED files . 23

  23. Thanks! Any questions? 24

Related


More Related Content

giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#giItT1WQy@!-/#