AliCo

A New Efficient Representation for SAM Files

Idoia Ochoa-Alvarez, Hongyi Li, Florian Baumgarte, Charles Hergenrother, Jan Voges, Mikel Hernaez

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.

Original languageEnglish (US)
Title of host publicationProceedings - DCC 2019
Subtitle of host publication2019 Data Compression Conference
EditorsJames A. Storer, Ali Bilgin, Joan Serra-Sagrista, Michael W. Marcellin
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages93-102
Number of pages10
ISBN (Electronic)9781728106571
DOIs
StatePublished - May 10 2019
Event2019 Data Compression Conference, DCC 2019 - Snowbird, United States
Duration: Mar 26 2019Mar 29 2019

Publication series

NameData Compression Conference Proceedings
Volume2019-March
ISSN (Print)1068-0314

Conference

Conference2019 Data Compression Conference, DCC 2019
CountryUnited States
CitySnowbird
Period3/26/193/29/19

Fingerprint

Compressors
Genes
Information use
Throughput
Processing
Costs

Keywords

  • Aligned data
  • Compression
  • Genomic data
  • SAM file

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Ochoa-Alvarez, I., Li, H., Baumgarte, F., Hergenrother, C., Voges, J., & Hernaez, M. (2019). AliCo: A New Efficient Representation for SAM Files. In J. A. Storer, A. Bilgin, J. Serra-Sagrista, & M. W. Marcellin (Eds.), Proceedings - DCC 2019: 2019 Data Compression Conference (pp. 93-102). [8712770] (Data Compression Conference Proceedings; Vol. 2019-March). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/DCC.2019.00017

AliCo : A New Efficient Representation for SAM Files. / Ochoa-Alvarez, Idoia; Li, Hongyi; Baumgarte, Florian; Hergenrother, Charles; Voges, Jan; Hernaez, Mikel.

Proceedings - DCC 2019: 2019 Data Compression Conference. ed. / James A. Storer; Ali Bilgin; Joan Serra-Sagrista; Michael W. Marcellin. Institute of Electrical and Electronics Engineers Inc., 2019. p. 93-102 8712770 (Data Compression Conference Proceedings; Vol. 2019-March).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ochoa-Alvarez, I, Li, H, Baumgarte, F, Hergenrother, C, Voges, J & Hernaez, M 2019, AliCo: A New Efficient Representation for SAM Files. in JA Storer, A Bilgin, J Serra-Sagrista & MW Marcellin (eds), Proceedings - DCC 2019: 2019 Data Compression Conference., 8712770, Data Compression Conference Proceedings, vol. 2019-March, Institute of Electrical and Electronics Engineers Inc., pp. 93-102, 2019 Data Compression Conference, DCC 2019, Snowbird, United States, 3/26/19. https://doi.org/10.1109/DCC.2019.00017
Ochoa-Alvarez I, Li H, Baumgarte F, Hergenrother C, Voges J, Hernaez M. AliCo: A New Efficient Representation for SAM Files. In Storer JA, Bilgin A, Serra-Sagrista J, Marcellin MW, editors, Proceedings - DCC 2019: 2019 Data Compression Conference. Institute of Electrical and Electronics Engineers Inc. 2019. p. 93-102. 8712770. (Data Compression Conference Proceedings). https://doi.org/10.1109/DCC.2019.00017
Ochoa-Alvarez, Idoia ; Li, Hongyi ; Baumgarte, Florian ; Hergenrother, Charles ; Voges, Jan ; Hernaez, Mikel. / AliCo : A New Efficient Representation for SAM Files. Proceedings - DCC 2019: 2019 Data Compression Conference. editor / James A. Storer ; Ali Bilgin ; Joan Serra-Sagrista ; Michael W. Marcellin. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 93-102 (Data Compression Conference Proceedings).
@inproceedings{79d9fed773fc42d9bff95655ed2ae8da,
title = "AliCo: A New Efficient Representation for SAM Files",
abstract = "As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85{\%} reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.",
keywords = "Aligned data, Compression, Genomic data, SAM file",
author = "Idoia Ochoa-Alvarez and Hongyi Li and Florian Baumgarte and Charles Hergenrother and Jan Voges and Mikel Hernaez",
year = "2019",
month = "5",
day = "10",
doi = "10.1109/DCC.2019.00017",
language = "English (US)",
series = "Data Compression Conference Proceedings",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "93--102",
editor = "Storer, {James A.} and Ali Bilgin and Joan Serra-Sagrista and Marcellin, {Michael W.}",
booktitle = "Proceedings - DCC 2019",
address = "United States",

}

TY - GEN

T1 - AliCo

T2 - A New Efficient Representation for SAM Files

AU - Ochoa-Alvarez, Idoia

AU - Li, Hongyi

AU - Baumgarte, Florian

AU - Hergenrother, Charles

AU - Voges, Jan

AU - Hernaez, Mikel

PY - 2019/5/10

Y1 - 2019/5/10

N2 - As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.

AB - As genome sequencing continues to become more cost-effective and affordable, more raw and aligned genomic files are expected to be generated in future years. In addition, due to the increase in the throughput of sequencing machines, the size of these files is significantly growing. In particular, aligned files (e.g., SAM/BAM) are used for further processing of the data, and hence efficient representation of these files is a pressing need. In this work we present AliCo, a new compression method tailored to the aligned data represented in the SAM format. We demonstrate through simulations on existing datasets that AliCo outperforms in compression ratio, on average, the state-of-the-art compressors for SAM files, achieving more than 85% reduction in size when operating in its lossless mode. AliCo also supports a variety of modes for lossy compression of the quality scores, including for the first time the recently proposed lossy compressor CALQ, which uses information from the aligned reads to adjust the level of quantization for each location of the genome (achieving more than 10× compression gains in high-coverage datasets). AliCo also supports optional compression of the reference sequence used for compression, hence guaranteeing exact reconstruction of the compressed data. Finally, AliCo allows to stream the data as it is being compressed, as well as to decompress the data as it is being received, potentially providing significant time savings.

KW - Aligned data

KW - Compression

KW - Genomic data

KW - SAM file

UR - http://www.scopus.com/inward/record.url?scp=85066315861&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85066315861&partnerID=8YFLogxK

U2 - 10.1109/DCC.2019.00017

DO - 10.1109/DCC.2019.00017

M3 - Conference contribution

T3 - Data Compression Conference Proceedings

SP - 93

EP - 102

BT - Proceedings - DCC 2019

A2 - Storer, James A.

A2 - Bilgin, Ali

A2 - Serra-Sagrista, Joan

A2 - Marcellin, Michael W.

PB - Institute of Electrical and Electronics Engineers Inc.

ER -