TY - JOUR
T1 - ChIPWig
T2 - A random access-enabling lossless and lossy compression method for ChIP-seq data
AU - Ravanmehr, Vida
AU - Kim, Minji
AU - Wang, Zhiying
AU - Milenkovic, Olgica
N1 - The work was supported in part by the National Institutes of Health (NIH) Big Data to Knowledge (BD2K) Targeted Software Development program, under the grant number 5U01CA198943-03, the National Science Foundation (NSF) Emerging Frontiers for Science of Information Center, and NSF Graduate Research Fellowship Program DGE-1144245.
PY - 2018/3/15
Y1 - 2018/3/15
N2 - Motivation Chromatin immunoprecipitation sequencing (ChIP-seq) experiments are inexpensive and time-efficient, and result in massive datasets that introduce significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. ChIPWig enables random access, summary statistics lookups and it is based on the asymptotic theory of optimal point density design for nonuniform quantizers. Results We tested the ChIPWig compressor on 10 ChIP-seq datasets generated by the ENCODE consortium. On average, lossless ChIPWig reduced the file sizes to merely 6% of the original, and offered 6-fold compression rate improvement compared to bigWig. The lossy feature further reduced file sizes 2-fold compared to the lossless mode, with little or no effects on peak calling and motif discovery using specialized NarrowPeaks methods. The compression and decompression speed rates are of the order of 0.2 sec/MB using general purpose computers. Availability and implementation The source code and binaries are freely available for download at https://github.com/vidarmehr/ChIPWig-v2, implemented in C ++. Contact [email protected] Supplementary informationSupplementary dataare available at Bioinformatics online.
AB - Motivation Chromatin immunoprecipitation sequencing (ChIP-seq) experiments are inexpensive and time-efficient, and result in massive datasets that introduce significant storage and maintenance challenges. To address the resulting Big Data problems, we propose a lossless and lossy compression framework specifically designed for ChIP-seq Wig data, termed ChIPWig. ChIPWig enables random access, summary statistics lookups and it is based on the asymptotic theory of optimal point density design for nonuniform quantizers. Results We tested the ChIPWig compressor on 10 ChIP-seq datasets generated by the ENCODE consortium. On average, lossless ChIPWig reduced the file sizes to merely 6% of the original, and offered 6-fold compression rate improvement compared to bigWig. The lossy feature further reduced file sizes 2-fold compared to the lossless mode, with little or no effects on peak calling and motif discovery using specialized NarrowPeaks methods. The compression and decompression speed rates are of the order of 0.2 sec/MB using general purpose computers. Availability and implementation The source code and binaries are freely available for download at https://github.com/vidarmehr/ChIPWig-v2, implemented in C ++. Contact [email protected] Supplementary informationSupplementary dataare available at Bioinformatics online.
UR - https://www.scopus.com/pages/publications/85044340867
UR - https://www.scopus.com/pages/publications/85044340867#tab=citedBy
U2 - 10.1093/bioinformatics/btx685
DO - 10.1093/bioinformatics/btx685
M3 - Article
C2 - 29087447
AN - SCOPUS:85044340867
SN - 1367-4803
VL - 34
SP - 911
EP - 919
JO - Bioinformatics
JF - Bioinformatics
IS - 6
ER -