TY - GEN
T1 - QuickSel
T2 - 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD 2020
AU - Park, Yongjoo
AU - Zhong, Shucheng
AU - Mozafari, Barzan
N1 - Funding Information:
This material is based upon work supported by the National Science Foundation under Grant No. 1629397 and the Michigan Institute for Data Science (MIDAS) PODS. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Publisher Copyright:
© 2020 Association for Computing Machinery.
Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2020/6/14
Y1 - 2020/6/14
N2 - Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, query-driven histograms have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice - -i.e., require an exponential number of buckets - -or quickly lose their advantage as they observe more queries. In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal model of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from each query and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x - 179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8% - 91.8% more accurate than periodically-updated histograms and samples, respectively.
AB - Estimating the selectivity of a query is a key step in almost any cost-based query optimizer. Most of today's databases rely on histograms or samples that are periodically refreshed by re-scanning the data as the underlying data changes. Since frequent scans are costly, these statistics are often stale and lead to poor selectivity estimates. As an alternative to scans, query-driven histograms have been proposed, which refine the histograms based on the actual selectivities of the observed queries. Unfortunately, these approaches are either too costly to use in practice - -i.e., require an exponential number of buckets - -or quickly lose their advantage as they observe more queries. In this paper, we propose a selectivity learning framework, called QuickSel, which falls into the query-driven paradigm but does not use histograms. Instead, it builds an internal model of the underlying data, which can be refined significantly faster (e.g., only 1.9 milliseconds for 300 queries). This fast refinement allows QuickSel to continuously learn from each query and yield increasingly more accurate selectivity estimates over time. Unlike query-driven histograms, QuickSel relies on a mixture model and a new optimization algorithm for training its model. Our extensive experiments on two real-world datasets confirm that, given the same target accuracy, QuickSel is 34.0x - 179.4x faster than state-of-the-art query-driven histograms, including ISOMER and STHoles. Further, given the same space budget, QuickSel is 26.8% - 91.8% more accurate than periodically-updated histograms and samples, respectively.
KW - approximate query processing
KW - cardinality estimation
KW - database learning
KW - selectivity estimation
KW - selectivity learning
UR - https://www.scopus.com/pages/publications/85086245467
UR - https://www.scopus.com/pages/publications/85086245467#tab=citedBy
U2 - 10.1145/3318464.3389727
DO - 10.1145/3318464.3389727
M3 - Conference contribution
AN - SCOPUS:85086245467
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1017
EP - 1033
BT - SIGMOD 2020 - Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 14 June 2020 through 19 June 2020
ER -