Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments

Tavor Z. Baharav, Govinda M. Kamath, David N. Tse, Ilan Shomorony

Research output: Contribution to journalArticlepeer-review

Abstract

Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores. Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, k-mer Jaccard similarities are often used as a proxy for alignment size to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases or repeats), Jaccard similarity is no longer a good proxy for alignment size. We introduce a min-hash-based approach to estimate alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We show that this metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores. To speed up pairwise sequence alignment, pairwise k-mer Jaccard similarities are often used as a proxy for alignment size. However, Jaccard similarity ceases to be a good proxy for alignment size when the k-mer distribution of the dataset is significantly non-uniform (e.g., due to GC biases and repeats). We introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which accounts for uneven k-mer distributions leading to significantly better performance.

Original languageEnglish (US)
Article number100081
JournalPatterns
Volume1
Issue number6
DOIs
StatePublished - Sep 11 2020

Keywords

  • DNA sequencing
  • DSML 1: Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem
  • Jaccard similarity
  • genome assembly
  • locality-sensitive hashing
  • min-hash
  • sequence alignment
  • singular value decomposition
  • spectral methods

ASJC Scopus subject areas

  • General Decision Sciences

Fingerprint

Dive into the research topics of 'Spectral Jaccard Similarity: A New Approach to Estimating Pairwise Sequence Alignments'. Together they form a unique fingerprint.

Cite this