TY - GEN
T1 - Efficient similarity queries via lossy compression
AU - Ochoa, Idoia
AU - Ingber, Amir
AU - Weissman, Tsachy
PY - 2013
Y1 - 2013
N2 - The generation of new databases and the amount of data on existing ones is growing exponentially. As a result, executing queries on large databases is becoming a timely and challenging task. With this in mind, we study the problem of compressing sequences in a database so that similarity queries can be performed efficiently on the compressed database. The fundamental limits of this problem characterize the tradeoff between compression rate and the reliability of the queries performed on the compressed data. While those asymptotic limits have been studied and characterized in past work, how to approach these limits in practice has remained largely unexplored. In this paper, we propose an approach to this task, based in part on existing lossy compression algorithms. Specifically, we consider queries of the form: 'which sequences in the database are similar to a given sequence y?'. For the case where similarity between sequences is measured via Hamming distortion, we construct schemes whose performance is close to the fundamental limits. Furthermore, we test our scheme on a sample database of real DNA sequences, and show significant compression while still allowing highly reliable query answers.
AB - The generation of new databases and the amount of data on existing ones is growing exponentially. As a result, executing queries on large databases is becoming a timely and challenging task. With this in mind, we study the problem of compressing sequences in a database so that similarity queries can be performed efficiently on the compressed database. The fundamental limits of this problem characterize the tradeoff between compression rate and the reliability of the queries performed on the compressed data. While those asymptotic limits have been studied and characterized in past work, how to approach these limits in practice has remained largely unexplored. In this paper, we propose an approach to this task, based in part on existing lossy compression algorithms. Specifically, we consider queries of the form: 'which sequences in the database are similar to a given sequence y?'. For the case where similarity between sequences is measured via Hamming distortion, we construct schemes whose performance is close to the fundamental limits. Furthermore, we test our scheme on a sample database of real DNA sequences, and show significant compression while still allowing highly reliable query answers.
UR - http://www.scopus.com/inward/record.url?scp=84897732161&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84897732161&partnerID=8YFLogxK
U2 - 10.1109/Allerton.2013.6736618
DO - 10.1109/Allerton.2013.6736618
M3 - Conference contribution
AN - SCOPUS:84897732161
SN - 9781479934096
T3 - 2013 51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013
SP - 883
EP - 889
BT - 2013 51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013
PB - IEEE Computer Society
T2 - 51st Annual Allerton Conference on Communication, Control, and Computing, Allerton 2013
Y2 - 2 October 2013 through 4 October 2013
ER -