AliCo: A New Efficient Representation for SAM Files

Idoia Ochoa, Hongyi Li, Florian Baumgarte, Charles Hergenrother, Jan Voges, Mikel Hernaez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.

Original languageEnglish (US)
Title of host publicationProceedings - DCC 2019
Subtitle of host publication2019 Data Compression Conference
EditorsJames A. Storer, Ali Bilgin, Joan Serra-Sagrista, Michael W. Marcellin
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages93-102
Number of pages10
ISBN (Electronic)9781728106571
DOIs
StatePublished - May 10 2019
Event2019 Data Compression Conference, DCC 2019 - Snowbird, United States
Duration: Mar 26 2019Mar 29 2019

Publication series

NameData Compression Conference Proceedings
Volume2019-March
ISSN (Print)1068-0314

Conference

Conference2019 Data Compression Conference, DCC 2019
Country/TerritoryUnited States
CitySnowbird
Period3/26/193/29/19

Keywords

  • Aligned data
  • Compression
  • Genomic data
  • SAM file

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'AliCo: A New Efficient Representation for SAM Files'. Together they form a unique fingerprint.

Cite this