TY - JOUR
T1 - GTRAC
T2 - Fast retrieval from compressed collections of genomic variants
AU - Tatwawadi, Kedar
AU - Hernaez, Mikel
AU - Ochoa, Idoia
AU - Weissman, Tsachy
N1 - Funding Information:
This work was partially supported by a fellowship from the Basque Government, a Stanford Graduate Fellowships Program in Science and Engineering, the Stanford Data Science Initiative (SDSI), and an NIH grant with number 1 U01 CA198943-01.
Publisher Copyright:
© 2016 The Author 2016. Published by Oxford University Press. All rights reserved.
PY - 2016/9/1
Y1 - 2016/9/1
N2 - Motivation: The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether. Results: We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures. Availability and Implementation: The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC
AB - Motivation: The dramatic decrease in the cost of sequencing has resulted in the generation of huge amounts of genomic data, as evidenced by projects such as the UK10K and the Million Veteran Project, with the number of sequenced genomes ranging in the order of 10 K to 1 M. Due to the large redundancies among genomic sequences of individuals from the same species, most of the medical research deals with the variants in the sequences as compared with a reference sequence, rather than with the complete genomic sequences. Consequently, millions of genomes represented as variants are stored in databases. These databases are constantly updated and queried to extract information such as the common variants among individuals or groups of individuals. Previous algorithms for compression of this type of databases lack efficient random access capabilities, rendering querying the database for particular variants and/or individuals extremely inefficient, to the point where compression is often relinquished altogether. Results: We present a new algorithm for this task, called GTRAC, that achieves significant compression ratios while allowing fast random access over the compressed database. For example, GTRAC is able to compress a Homo sapiens dataset containing 1092 samples in 1.1 GB (compression ratio of 160), while allowing for decompression of specific samples in less than a second and decompression of specific variants in 17 ms. GTRAC uses and adapts techniques from information theory, such as a specialized Lempel-Ziv compressor, and tailored succinct data structures. Availability and Implementation: The GTRAC algorithm is available for download at: https://github.com/kedartatwawadi/GTRAC
UR - http://www.scopus.com/inward/record.url?scp=84990944459&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84990944459&partnerID=8YFLogxK
U2 - 10.1093/bioinformatics/btw437
DO - 10.1093/bioinformatics/btw437
M3 - Article
C2 - 27587665
AN - SCOPUS:84990944459
SN - 1367-4803
VL - 32
SP - i479-i486
JO - Bioinformatics
JF - Bioinformatics
IS - 17
ER -