TY - GEN
T1 - MCUIUC - A new framework for metagenomic read compression
AU - Ligo, Jonathan G.
AU - Kim, Minji
AU - Emad, Amin
AU - Milenkovic, Olgica
AU - Veeravalli, Venugopal V.
PY - 2013
Y1 - 2013
N2 - Metagenomics is an emerging field of molecular biology concerned with analyzing the genomes of environmental samples comprising many different diverse organisms. Given the nature of metagenomic data, one usually has to sequence the genomic material of all organisms in a batch, leading to a mix of reads coming from different DNA sequences. In deep high-throughput sequencing experiments, the volume of the raw reads is extremely high, frequently exceeding 600 Gb. With an ever increasing demand for storing such reads for future studies, the issue of efficient metagenomic compression becomes of paramount importance. We present the first known approach to metagenome read compression, termed MCUIUC (Metagenomic Compression at UIUC). The gist of the proposed algorithm is to perform classification of reads based on unique organism identifiers, followed by reference-based alignment of reads for individually identified organisms, and metagenomic assembly of unclassified reads. Once assembly and classification are completed, lossless reference based compression is performed via positional encoding. We evaluate the performance of the algorithm on moderate sized synthetic metagenomic samples involving 15 randomly selected organisms and describe future directions for improving the proposed compression method.
AB - Metagenomics is an emerging field of molecular biology concerned with analyzing the genomes of environmental samples comprising many different diverse organisms. Given the nature of metagenomic data, one usually has to sequence the genomic material of all organisms in a batch, leading to a mix of reads coming from different DNA sequences. In deep high-throughput sequencing experiments, the volume of the raw reads is extremely high, frequently exceeding 600 Gb. With an ever increasing demand for storing such reads for future studies, the issue of efficient metagenomic compression becomes of paramount importance. We present the first known approach to metagenome read compression, termed MCUIUC (Metagenomic Compression at UIUC). The gist of the proposed algorithm is to perform classification of reads based on unique organism identifiers, followed by reference-based alignment of reads for individually identified organisms, and metagenomic assembly of unclassified reads. Once assembly and classification are completed, lossless reference based compression is performed via positional encoding. We evaluate the performance of the algorithm on moderate sized synthetic metagenomic samples involving 15 randomly selected organisms and describe future directions for improving the proposed compression method.
UR - http://www.scopus.com/inward/record.url?scp=84893286104&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84893286104&partnerID=8YFLogxK
U2 - 10.1109/ITW.2013.6691312
DO - 10.1109/ITW.2013.6691312
M3 - Conference contribution
AN - SCOPUS:84893286104
SN - 9781479913237
T3 - 2013 IEEE Information Theory Workshop, ITW 2013
BT - 2013 IEEE Information Theory Workshop, ITW 2013
T2 - 2013 IEEE Information Theory Workshop, ITW 2013
Y2 - 9 September 2013 through 13 September 2013
ER -