TY - JOUR
T1 - Spectral Jaccard Similarity
T2 - A New Approach to Estimating Pairwise Sequence Alignments
AU - Baharav, Tavor Z.
AU - Kamath, Govinda M.
AU - Tse, David N.
AU - Shomorony, Ilan
N1 - The authors gratefully acknowledge funding from the NSF GRFP; Alcatel-Lucent Stanford Graduate Fellowship; NSF grant under CCF- 1563098 ; and the Center for Social Inclusion , an NSF Science and Technology Center under grant agreement CCF- 0939370 .
PY - 2020/9/11
Y1 - 2020/9/11
N2 - Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores. Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, k-mer Jaccard similarities are often used as a proxy for alignment size to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases or repeats), Jaccard similarity is no longer a good proxy for alignment size. We introduce a min-hash-based approach to estimate alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We show that this metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores. To speed up pairwise sequence alignment, pairwise k-mer Jaccard similarities are often used as a proxy for alignment size. However, Jaccard similarity ceases to be a good proxy for alignment size when the k-mer distribution of the dataset is significantly non-uniform (e.g., due to GC biases and repeats). We introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which accounts for uneven k-mer distributions leading to significantly better performance.
AB - Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, the pairwise k-mer Jaccard similarity is sometimes used as a proxy for alignment size in order to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases and repeats), Jaccard similarity is no longer a good proxy for alignment size. In this work, we introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We empirically show that this new metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores. Pairwise sequence alignment is often a computational bottleneck in genomic analysis pipelines, particularly in the context of third-generation sequencing technologies. To speed up this process, k-mer Jaccard similarities are often used as a proxy for alignment size to filter pairs of reads, and min-hashes are employed to efficiently estimate these similarities. However, when the k-mer distribution of a dataset is significantly non-uniform (e.g., due to GC biases or repeats), Jaccard similarity is no longer a good proxy for alignment size. We introduce a min-hash-based approach to estimate alignment sizes called Spectral Jaccard Similarity, which naturally accounts for uneven k-mer distributions. The Spectral Jaccard Similarity is computed by performing a singular value decomposition on a min-hash collision matrix. We show that this metric provides significantly better estimates for alignment sizes, and we provide a computationally efficient estimator for these spectral similarity scores. To speed up pairwise sequence alignment, pairwise k-mer Jaccard similarities are often used as a proxy for alignment size. However, Jaccard similarity ceases to be a good proxy for alignment size when the k-mer distribution of the dataset is significantly non-uniform (e.g., due to GC biases and repeats). We introduce a min-hash-based approach for estimating alignment sizes called Spectral Jaccard Similarity, which accounts for uneven k-mer distributions leading to significantly better performance.
KW - DNA sequencing
KW - DSML 1: Proof-of-Concept: Data science output has been formulated, implemented, and tested for one domain/problem
KW - Jaccard similarity
KW - genome assembly
KW - locality-sensitive hashing
KW - min-hash
KW - sequence alignment
KW - singular value decomposition
KW - spectral methods
UR - https://www.scopus.com/pages/publications/85102967407
UR - https://www.scopus.com/inward/citedby.url?scp=85102967407&partnerID=8YFLogxK
U2 - 10.1016/j.patter.2020.100081
DO - 10.1016/j.patter.2020.100081
M3 - Article
C2 - 33205128
AN - SCOPUS:85102967407
SN - 2666-3899
VL - 1
JO - Patterns
JF - Patterns
IS - 6
M1 - 100081
ER -