A Cluster-Based Approach to Compression of Quality Scores

Mikel Hernaez, Idoia Ochoa, Tsachy Weissman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Massive amounts of sequencing data are being generated thanks to advances in sequencing technology and a dramatic drop in the sequencing cost. Storing and sharing this large data has become a major bottleneck in the discovery and analysis of genetic variants that are used for medical inference. As such, lossless compression of this data has been proposed. Of the compressed data, more than 70% correspond to quality scores, which indicate the sequencing machine reliability when calling a particular basepair. Thus, to further improve the compression performance, lossy compression of quality scores is emerging as the natural candidate. Since the data is used for genetic variants discovery, lossy compressors for quality scores are analyzed in terms of their rate-distortion performance, as well as their effect on the variant callers. Previously proposed algorithms do not do well under all performance metrics, and are hence unsuitable for certain applications. In this work we propose a new lossy compressor that first performs a clustering step, by assuming all the quality scores sequences come from a mixture of Markov models. Then, it performs quantization of the quality scores based on the Markov models. Each quantizer targets a specific distortion to optimize for the overall rate-distortion performance. Finally, the quantized values are compressed by an entropy encoder. We demonstrate that the proposed lossy compressor outperforms the previously proposed methods under all analyzed distortion metrics. This suggests that the effect that the proposed algorithm will have on any downstream application will likely be less noticeable than that of previously proposed lossy compressors. Moreover, we analyze how the proposed lossy compressor affects Single Nucleotide Polymorphism (SNP) calling, and show that the variability introduced on the calls is considerably smaller than the variability that exists between different methodologies for SNP calling.

Original languageEnglish (US)
Title of host publicationProceedings - DCC 2016
Subtitle of host publication2016 Data Compression Conference
EditorsMichael W. Marcellin, Ali Bilgin, Joan Serra-Sagrista, James A. Storer
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages261-270
Number of pages10
ISBN (Electronic)9781509018536
DOIs
StatePublished - Dec 15 2016
Event2016 Data Compression Conference, DCC 2016 - Snowbird, United States
Duration: Mar 29 2016Apr 1 2016

Publication series

NameData Compression Conference Proceedings
ISSN (Print)1068-0314

Other

Other2016 Data Compression Conference, DCC 2016
Country/TerritoryUnited States
CitySnowbird
Period3/29/164/1/16

Keywords

  • Clustering
  • Genomic data
  • Lossy Compression
  • Markov Mixture Models
  • Quality Scores
  • Variant Calling

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'A Cluster-Based Approach to Compression of Quality Scores'. Together they form a unique fingerprint.

Cite this