TY - GEN
T1 - Sampling cube
T2 - 2008 ACM SIGMOD International Conference on Management of Data 2008, SIGMOD'08
AU - Li, Xiaolei
AU - Han, Jiawei
AU - Yin, Zhijun
AU - Lee, Jae Gil
AU - Sun, Yizhou
PY - 2008
Y1 - 2008
N2 - Sampling is a popular method of data collection when it is impossible or too costly to reach the entire population. For example, television show ratings in the United States are gathered from a sample of roughly 5,000 households. To use the results effectively, the samples are further partitioned in a multidimensional space based on multiple attribute values. This naturally leads to the desirability of OLAP (Online Analytical Processing) over sampling data. However, unlike traditional data, sampling data is inherently uncertain, i.e., not representing the full data in the population. Thus, it is desirable to return not only query results but also the confidence intervals indicating the reliability of the results. Moreover, a certain segment in a multidimensional space may contain none or too few samples. This requires some additional analysis to return trustable results. In this paper we propose a Sampling Cube framework, which efficiently calculates confidence intervals for any multidimensional query and uses the OLAP structure to group similar segments to increase sampling size when needed. Further, to handle high dimensional data, a Sampling Cube Shell method is proposed to effectively reduce the storage requirement while still preserving query result quality.
AB - Sampling is a popular method of data collection when it is impossible or too costly to reach the entire population. For example, television show ratings in the United States are gathered from a sample of roughly 5,000 households. To use the results effectively, the samples are further partitioned in a multidimensional space based on multiple attribute values. This naturally leads to the desirability of OLAP (Online Analytical Processing) over sampling data. However, unlike traditional data, sampling data is inherently uncertain, i.e., not representing the full data in the population. Thus, it is desirable to return not only query results but also the confidence intervals indicating the reliability of the results. Moreover, a certain segment in a multidimensional space may contain none or too few samples. This requires some additional analysis to return trustable results. In this paper we propose a Sampling Cube framework, which efficiently calculates confidence intervals for any multidimensional query and uses the OLAP structure to group similar segments to increase sampling size when needed. Further, to handle high dimensional data, a Sampling Cube Shell method is proposed to effectively reduce the storage requirement while still preserving query result quality.
KW - Algorithms
KW - Design
KW - Experimentation
UR - http://www.scopus.com/inward/record.url?scp=57149135463&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=57149135463&partnerID=8YFLogxK
U2 - 10.1145/1376616.1376695
DO - 10.1145/1376616.1376695
M3 - Conference contribution
AN - SCOPUS:57149135463
SN - 9781605581026
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 779
EP - 790
BT - SIGMOD 2008
Y2 - 9 June 2008 through 12 June 2008
ER -