TY - GEN
T1 - Compression schemes for similarity queries
AU - Ochoa, Idoia
AU - Ingber, Amir
AU - Weissman, Tsachy
PY - 2014/1/1
Y1 - 2014/1/1
N2 - We consider compression of sequences in a database so that similarity queries can be performed efficiently in the compressed domain. The fundamental limits for this problem setting, which characterize the trade off between compression rate and reliability of the answers to the queries, have been characterized in past work. However, how to approach these limits in practice has remained largely unexplored. Recently, we proposed a scheme for this task that is based on existing lossy compression algorithms, for the general case where the similarity measure satisfies a triangle inequality. Although it was shown that it achieves the fundamental limits for some cases, it is suboptimal in general. In this paper we propose a new scheme that also uses lossy compression algorithms as a building block, but with a carefully chosen distortion measure that is different than the one defining the similarity between sequences. The new scheme significantly improves the compression rate compared to the previously proposed scheme in many cases. For example, for binary sources and Hamming similarity measure, simulation results show a compression rate close to the fundamental limit, and an improvement over the previously proposed scheme of up to 55% (for the same reliability). The results shed light on the fact that compression for similarity identification is inherently different than classical lossy compression.
AB - We consider compression of sequences in a database so that similarity queries can be performed efficiently in the compressed domain. The fundamental limits for this problem setting, which characterize the trade off between compression rate and reliability of the answers to the queries, have been characterized in past work. However, how to approach these limits in practice has remained largely unexplored. Recently, we proposed a scheme for this task that is based on existing lossy compression algorithms, for the general case where the similarity measure satisfies a triangle inequality. Although it was shown that it achieves the fundamental limits for some cases, it is suboptimal in general. In this paper we propose a new scheme that also uses lossy compression algorithms as a building block, but with a carefully chosen distortion measure that is different than the one defining the similarity between sequences. The new scheme significantly improves the compression rate compared to the previously proposed scheme in many cases. For example, for binary sources and Hamming similarity measure, simulation results show a compression rate close to the fundamental limit, and an improvement over the previously proposed scheme of up to 55% (for the same reliability). The results shed light on the fact that compression for similarity identification is inherently different than classical lossy compression.
UR - http://www.scopus.com/inward/record.url?scp=84903469483&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84903469483&partnerID=8YFLogxK
U2 - 10.1109/DCC.2014.37
DO - 10.1109/DCC.2014.37
M3 - Conference contribution
AN - SCOPUS:84903469483
SN - 9781479938827
T3 - Data Compression Conference Proceedings
SP - 332
EP - 341
BT - Proceedings - DCC 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 Data Compression Conference, DCC 2014
Y2 - 26 March 2014 through 28 March 2014
ER -