SQUISH: Near-optimal compression for archival of relational datasets

Yihan Gao, Aditya Parameswaran

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Relational datasets are being generated at an alarmingly rapid rate across organizations and industries. Compressing these datasets could significantly reduce storage and archival costs. Traditional compression algorithms, e.g., gzip, are suboptimal for compressing relational datasets since they ignore the table structure and relationships between attributes. We study compression algorithms that leverage the relational structure to compress datasets to a much greater extent. We develop SQUISH, a system that uses a combination of Bayesian Networks and Arithmetic Coding to capture multiple kinds of dependencies among attributes and achieve near-entropy compression rate. SQUISH also supports user-defined attributes: users can instantiate new data types by simply implementing five functions for a new class interface. We prove the asymptotic optimality of our compression algorithm and conduct experiments to show the effectiveness of our system: SQUISH achieves a reduction of over 50% in storage size relative to systems developed in prior work on a variety of real datasets.

Original languageEnglish (US)
Title of host publicationKDD 2016 - Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages1575-1584
Number of pages10
ISBN (Electronic)9781450342322
DOIs
StatePublished - Aug 13 2016
Event22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016 - San Francisco, United States
Duration: Aug 13 2016Aug 17 2016

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
Volume13-17-August-2016

Other

Other22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD 2016
CountryUnited States
CitySan Francisco
Period8/13/168/17/16

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint Dive into the research topics of 'SQUISH: Near-optimal compression for archival of relational datasets'. Together they form a unique fingerprint.

  • Cite this

    Gao, Y., & Parameswaran, A. (2016). SQUISH: Near-optimal compression for archival of relational datasets. In KDD 2016 - Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 1575-1584). (Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; Vol. 13-17-August-2016). Association for Computing Machinery. https://doi.org/10.1145/2939672.2939867