GeneComp, a New Reference-Based Compressor for SAM Files

Reggy Long, Mikel Hernaez, Idoia Ochoa, Tsachy Weissman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The affordability of DNA sequencing has led to unprecedented volumes of genomic data. These data must be stored, processed, and analyzed. The most popular format for genomic data is the SAM format, which contains information such as alignment, quality values, etc. These files are large (on the order of terabytes), which necessitates compression. In this work we propose a new reference-based compressor for SAM files, which can accommodate different levels of compression, based on the specific needs of the user. In particular, the proposed compressor GeneComp allows the user to perform lossy compression of the quality scores, which have been proven to occupy more than half of the compressed file (when losslessly compressed). We show that the proposed compressor GeneComp overall achieves better compression ratios than previously proposed algorithms when working on lossless mode.

Original languageEnglish (US)
Title of host publicationProceedings - DCC 2017, 2017 Data Compression Conference
EditorsAli Bilgin, Joan Serra-Sagrista, Michael W. Marcellin, James A. Storer
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages330-339
Number of pages10
ISBN (Electronic)9781509067213
DOIs
StatePublished - May 8 2017
Event2017 Data Compression Conference, DCC 2017 - Snowbird, United States
Duration: Apr 4 2017Apr 7 2017

Publication series

NameData Compression Conference Proceedings
VolumePart F127767
ISSN (Print)1068-0314

Other

Other2017 Data Compression Conference, DCC 2017
Country/TerritoryUnited States
CitySnowbird
Period4/4/174/7/17

Keywords

  • compression
  • genomic data
  • sam file

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'GeneComp, a New Reference-Based Compressor for SAM Files'. Together they form a unique fingerprint.

Cite this