Optimal lower bounds for locality-sensitive hashing (except when q is tiny)

Ryan O'Donnell, Yi Wu, Yuan Zhou

Research output: Contribution to journalArticle

Abstract

We study lower bounds for Locality-Sensitive Hashing (LSH) in the strongest setting: point sets in {0,1}d under the Hamming distance. Recall that H is said to be an (r,cr,p, q)-sensitive hash family if all pairs x,y ∈ {0,1}d with dist(x,y) ≤ r have probability at least p of collision under a randomly chosen h ∈ H, whereas all pairs x,y ∈ {0,1}d with dist(x,y) ≥ cr have probability at most q of collision. Typically, one considers d → ∞, with c > 1 fixed and q bounded away from 0. For its applications to approximate nearest-neighbor search in high dimensions, the quality of an LSH family H is governed by how small its ρ parameter ρ = ln(1/p)/ln(1/q) is as a function of the parameter c. The seminal paper of Indyk and Motwani [1998] showed that for each c ≥ 1, the extremely simple family H = {x → xi: i ∈ [d]} achieves ρ ≤ 1/c. The only known lower bound, due to Motwani et al. [2007], is that ρ must be at least (e1/c - 1)/(e1/c + 1) ≥.46/c (minus od(1)). The contribution of this article is twofold. (1) We show the "optimal" lower bound for ρ: it must be at least 1/c (minus od(1)). Our proof is very simple, following almost immediately from the observation that the noise stability of a boolean function at time t is a log-convex function of t. (2) We raise and discuss the following issue: neither the application of LSH to nearest-neighbor search nor the known LSH lower bounds hold as stated if the q parameter is tiny. Here, "tiny" means q = 2-Θ(d), a parameter range we believe is natural.

Original languageEnglish (US)
Article number5
JournalACM Transactions on Computation Theory
Volume6
Issue number1
DOIs
StatePublished - Mar 2014
Externally publishedYes

Fingerprint

Hashing
Locality
Lower bound
Nearest Neighbor Search
Hamming distance
Boolean functions
Collision
Hamming Distance
Boolean Functions
Point Sets
Small Parameter
Higher Dimensions
Convex function
Immediately
Nearest neighbor search
Range of data
Family

Keywords

  • Fourier analysis of boolean functions
  • Locality-Sensitive Hashing
  • LSH
  • Noise sensitivity
  • Noise stability

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computational Theory and Mathematics

Cite this

Optimal lower bounds for locality-sensitive hashing (except when q is tiny). / O'Donnell, Ryan; Wu, Yi; Zhou, Yuan.

In: ACM Transactions on Computation Theory, Vol. 6, No. 1, 5, 03.2014.

Research output: Contribution to journalArticle

@article{bdec73e6102d445885ccf6f7291d93f6,
title = "Optimal lower bounds for locality-sensitive hashing (except when q is tiny)",
abstract = "We study lower bounds for Locality-Sensitive Hashing (LSH) in the strongest setting: point sets in {0,1}d under the Hamming distance. Recall that H is said to be an (r,cr,p, q)-sensitive hash family if all pairs x,y ∈ {0,1}d with dist(x,y) ≤ r have probability at least p of collision under a randomly chosen h ∈ H, whereas all pairs x,y ∈ {0,1}d with dist(x,y) ≥ cr have probability at most q of collision. Typically, one considers d → ∞, with c > 1 fixed and q bounded away from 0. For its applications to approximate nearest-neighbor search in high dimensions, the quality of an LSH family H is governed by how small its ρ parameter ρ = ln(1/p)/ln(1/q) is as a function of the parameter c. The seminal paper of Indyk and Motwani [1998] showed that for each c ≥ 1, the extremely simple family H = {x → xi: i ∈ [d]} achieves ρ ≤ 1/c. The only known lower bound, due to Motwani et al. [2007], is that ρ must be at least (e1/c - 1)/(e1/c + 1) ≥.46/c (minus od(1)). The contribution of this article is twofold. (1) We show the {"}optimal{"} lower bound for ρ: it must be at least 1/c (minus od(1)). Our proof is very simple, following almost immediately from the observation that the noise stability of a boolean function at time t is a log-convex function of t. (2) We raise and discuss the following issue: neither the application of LSH to nearest-neighbor search nor the known LSH lower bounds hold as stated if the q parameter is tiny. Here, {"}tiny{"} means q = 2-Θ(d), a parameter range we believe is natural.",
keywords = "Fourier analysis of boolean functions, Locality-Sensitive Hashing, LSH, Noise sensitivity, Noise stability",
author = "Ryan O'Donnell and Yi Wu and Yuan Zhou",
year = "2014",
month = "3",
doi = "10.1145/2578221",
language = "English (US)",
volume = "6",
journal = "ACM Transactions on Computation Theory",
issn = "1942-3454",
publisher = "Association for Computing Machinery (ACM)",
number = "1",

}

TY - JOUR

T1 - Optimal lower bounds for locality-sensitive hashing (except when q is tiny)

AU - O'Donnell, Ryan

AU - Wu, Yi

AU - Zhou, Yuan

PY - 2014/3

Y1 - 2014/3

N2 - We study lower bounds for Locality-Sensitive Hashing (LSH) in the strongest setting: point sets in {0,1}d under the Hamming distance. Recall that H is said to be an (r,cr,p, q)-sensitive hash family if all pairs x,y ∈ {0,1}d with dist(x,y) ≤ r have probability at least p of collision under a randomly chosen h ∈ H, whereas all pairs x,y ∈ {0,1}d with dist(x,y) ≥ cr have probability at most q of collision. Typically, one considers d → ∞, with c > 1 fixed and q bounded away from 0. For its applications to approximate nearest-neighbor search in high dimensions, the quality of an LSH family H is governed by how small its ρ parameter ρ = ln(1/p)/ln(1/q) is as a function of the parameter c. The seminal paper of Indyk and Motwani [1998] showed that for each c ≥ 1, the extremely simple family H = {x → xi: i ∈ [d]} achieves ρ ≤ 1/c. The only known lower bound, due to Motwani et al. [2007], is that ρ must be at least (e1/c - 1)/(e1/c + 1) ≥.46/c (minus od(1)). The contribution of this article is twofold. (1) We show the "optimal" lower bound for ρ: it must be at least 1/c (minus od(1)). Our proof is very simple, following almost immediately from the observation that the noise stability of a boolean function at time t is a log-convex function of t. (2) We raise and discuss the following issue: neither the application of LSH to nearest-neighbor search nor the known LSH lower bounds hold as stated if the q parameter is tiny. Here, "tiny" means q = 2-Θ(d), a parameter range we believe is natural.

AB - We study lower bounds for Locality-Sensitive Hashing (LSH) in the strongest setting: point sets in {0,1}d under the Hamming distance. Recall that H is said to be an (r,cr,p, q)-sensitive hash family if all pairs x,y ∈ {0,1}d with dist(x,y) ≤ r have probability at least p of collision under a randomly chosen h ∈ H, whereas all pairs x,y ∈ {0,1}d with dist(x,y) ≥ cr have probability at most q of collision. Typically, one considers d → ∞, with c > 1 fixed and q bounded away from 0. For its applications to approximate nearest-neighbor search in high dimensions, the quality of an LSH family H is governed by how small its ρ parameter ρ = ln(1/p)/ln(1/q) is as a function of the parameter c. The seminal paper of Indyk and Motwani [1998] showed that for each c ≥ 1, the extremely simple family H = {x → xi: i ∈ [d]} achieves ρ ≤ 1/c. The only known lower bound, due to Motwani et al. [2007], is that ρ must be at least (e1/c - 1)/(e1/c + 1) ≥.46/c (minus od(1)). The contribution of this article is twofold. (1) We show the "optimal" lower bound for ρ: it must be at least 1/c (minus od(1)). Our proof is very simple, following almost immediately from the observation that the noise stability of a boolean function at time t is a log-convex function of t. (2) We raise and discuss the following issue: neither the application of LSH to nearest-neighbor search nor the known LSH lower bounds hold as stated if the q parameter is tiny. Here, "tiny" means q = 2-Θ(d), a parameter range we believe is natural.

KW - Fourier analysis of boolean functions

KW - Locality-Sensitive Hashing

KW - LSH

KW - Noise sensitivity

KW - Noise stability

UR - http://www.scopus.com/inward/record.url?scp=84897475487&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84897475487&partnerID=8YFLogxK

U2 - 10.1145/2578221

DO - 10.1145/2578221

M3 - Article

AN - SCOPUS:84897475487

VL - 6

JO - ACM Transactions on Computation Theory

JF - ACM Transactions on Computation Theory

SN - 1942-3454

IS - 1

M1 - 5

ER -