Effect of lossy compression of quality scores on variant calling

Idoia Ochoa, Mikel Hernaez, Rachel Goldfeder, Tsachy Weissman, Euan Ashley

Research output: Contribution to journalArticlepeer-review


Recent advancements in sequencing technology have led to a drastic reduction in genome sequencing costs. This development has generated an unprecedented amount of data that must be stored, processed, and communicated. To facilitate this effort, compression of genomic files has been proposed. Specifically, lossy compression of quality scores is emerging as a natural candidate for reducing the growing costs of storage. A main goal of performing DNA sequencing in population studies and clinical settings is to identify genetic variation. Though the field agrees that smaller files are advantageous, the cost of lossy compression, in terms of variant discovery, is unclear. Bioinformatic algorithms to identify SNPs and INDELs use base quality score information; here, we evaluate the effect of lossy compression of quality scores on SNP and INDEL detection. Specifically, we investigate how the output of the variant caller when using the original data differs from that obtained when quality scores are replaced by those generated by a lossy compressor. Using gold standard genomic datasets and simulated data, we are able to analyze how accurate the output of the variant calling is, both for the original data and that previously lossily compressed.We show that lossy compression can significantly alleviate the storage while maintaining variant calling performance comparable to that with the original data. Further, in some cases lossy compression can lead to variant calling performance that is superior to that using the original file. We envisage our findings and framework serving as a benchmark in future development and analyses of lossy genomic data compressors.

Original languageEnglish (US)
Pages (from-to)183-194
Number of pages12
JournalBriefings in bioinformatics
Issue number2
StatePublished - 2017


  • Genomic data
  • Lossy compression
  • Quality scores
  • Variant calling

ASJC Scopus subject areas

  • Information Systems
  • Molecular Biology


Dive into the research topics of 'Effect of lossy compression of quality scores on variant calling'. Together they form a unique fingerprint.

Cite this