Compression schemes for similarity queries

Idoia Ochoa, Amir Ingber, Tsachy Weissman

Research output: Chapter in Book/Report/Conference proceedingConference contribution


We consider compression of sequences in a database so that similarity queries can be performed efficiently in the compressed domain. The fundamental limits for this problem setting, which characterize the trade off between compression rate and reliability of the answers to the queries, have been characterized in past work. However, how to approach these limits in practice has remained largely unexplored. Recently, we proposed a scheme for this task that is based on existing lossy compression algorithms, for the general case where the similarity measure satisfies a triangle inequality. Although it was shown that it achieves the fundamental limits for some cases, it is suboptimal in general. In this paper we propose a new scheme that also uses lossy compression algorithms as a building block, but with a carefully chosen distortion measure that is different than the one defining the similarity between sequences. The new scheme significantly improves the compression rate compared to the previously proposed scheme in many cases. For example, for binary sources and Hamming similarity measure, simulation results show a compression rate close to the fundamental limit, and an improvement over the previously proposed scheme of up to 55% (for the same reliability). The results shed light on the fact that compression for similarity identification is inherently different than classical lossy compression.

Original languageEnglish (US)
Title of host publicationProceedings - DCC 2014
Subtitle of host publication2014 Data Compression Conference
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages10
ISBN (Print)9781479938827
StatePublished - Jan 1 2014
Externally publishedYes
Event2014 Data Compression Conference, DCC 2014 - Snowbird, UT, United States
Duration: Mar 26 2014Mar 28 2014

Publication series

NameData Compression Conference Proceedings
ISSN (Print)1068-0314


Other2014 Data Compression Conference, DCC 2014
Country/TerritoryUnited States
CitySnowbird, UT

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this