Efficient similarity queries via lossy compression

Idoia Ochoa, Amir Ingber, Tsachy Weissman

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The generation of new databases and the amount of data on existing ones is growing exponentially. As a result, executing queries on large databases is becoming a timely and challenging task. With this in mind, we study the problem of compressing sequences in a database so that similarity queries can be performed efficiently on the compressed database. The fundamental limits of this problem characterize the tradeoff between compression rate and the reliability of the queries performed on the compressed data. While those asymptotic limits have been studied and characterized in past work, how to approach these limits in practice has remained largely unexplored. In this paper, we propose an approach to this task, based in part on existing lossy compression algorithms. Specifically, we consider queries of the form: 'which sequences in the database are similar to a given sequence y?'. For the case where similarity between sequences is measured via Hamming distortion, we construct schemes whose performance is close to the fundamental limits. Furthermore, we test our scheme on a sample database of real DNA sequences, and show significant compression while still allowing highly reliable query answers.

Original languageEnglish (US)
Title of host publication2013 51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013
PublisherIEEE Computer Society
Pages883-889
Number of pages7
ISBN (Print)9781479934096
DOIs
StatePublished - 2013
Externally publishedYes
Event51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013 - Monticello, IL, United States
Duration: Oct 2 2013Oct 4 2013

Publication series

Name2013 51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013

Other

Other51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013
Country/TerritoryUnited States
CityMonticello, IL
Period10/2/1310/4/13

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Control and Systems Engineering

Fingerprint

Dive into the research topics of 'Efficient similarity queries via lossy compression'. Together they form a unique fingerprint.

Cite this