Sampling cube: A framework for statistical OLAP over sampling data

Xiaolei Li, Jiawei Han, Zhijun Yin, Jae Gil Lee, Yizhou Sun

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Sampling is a popular method of data collection when it is impossible or too costly to reach the entire population. For example, television show ratings in the United States are gathered from a sample of roughly 5,000 households. To use the results effectively, the samples are further partitioned in a multidimensional space based on multiple attribute values. This naturally leads to the desirability of OLAP (Online Analytical Processing) over sampling data. However, unlike traditional data, sampling data is inherently uncertain, i.e., not representing the full data in the population. Thus, it is desirable to return not only query results but also the confidence intervals indicating the reliability of the results. Moreover, a certain segment in a multidimensional space may contain none or too few samples. This requires some additional analysis to return trustable results. In this paper we propose a Sampling Cube framework, which efficiently calculates confidence intervals for any multidimensional query and uses the OLAP structure to group similar segments to increase sampling size when needed. Further, to handle high dimensional data, a Sampling Cube Shell method is proposed to effectively reduce the storage requirement while still preserving query result quality.

Original languageEnglish (US)
Title of host publicationSIGMOD 2008
Subtitle of host publicationProceedings of the ACM SIGMOD International Conference on Management of Data 2008
Pages779-790
Number of pages12
DOIs
StatePublished - 2008
Event2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08 - Vancouver, BC, Canada
Duration: Jun 9 2008Jun 12 2008

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
ISSN (Print)0730-8078

Other

Other2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08
Country/TerritoryCanada
CityVancouver, BC
Period6/9/086/12/08

Keywords

  • Algorithms
  • Design
  • Experimentation

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Sampling cube: A framework for statistical OLAP over sampling data'. Together they form a unique fingerprint.

Cite this