TY - JOUR
T1 - Accelerating approximate aggregation queries with expensive predicates
AU - Kang, Daniel
AU - Guibas, John
AU - Bailis, Peter
AU - Hashimoto, Tatsunori
AU - Sun, Yi
AU - Zaharia, Matei
N1 - Funding Information:
This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Infosys, NEC, and VMware—as well as Toyota Research Institute (“TRI”), Northrop Grumman, Amazon Web Services, Cisco, and the NSF under CAREER grant CNS-1651570. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. TRI provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
Funding Information:
This research was supported in part by affiliate members and other supporters of the Stanford DAWN project—Ant Financial, Facebook, Google, Infosys, NEC, and VMware—as well as Toyota Research Institute (“TRI”), Northrop Grumman, AmazonWeb Services, Cisco, and the NSF under CAREER grant CNS-1651570. Any opinions, -findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. TRI provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
Publisher Copyright:
© 2021, VLDB Endowment. All rights reserved.
PY - 2021
Y1 - 2021
N2 - Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard approximate query processing techniques to accelerate such queries are not applicable because they assume the result of the predicates are available ahead of time. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates. To accelerate aggregation queries with expensive predicates, we develop and analyze a query processing algorithm that leverages proxies (ABae). ABae must account for the key challenge that it may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group records into strata so that records satisfying the predicate are ideally grouped into few strata. Given these strata, ABae uses pilot sampling and plugin estimates to sample according to the optimal allocation. We show that ABae converges at an optimal rate in a novel analysis of stratified sampling with draws that may not satisfy the predicate. We further show that ABae outperforms on baselines on six real-world datasets, reducing labeling costs by up to 2.3×.
AB - Researchers and industry analysts are increasingly interested in computing aggregation queries over large, unstructured datasets with selective predicates that are computed using expensive deep neural networks (DNNs). As these DNNs are expensive and because many applications can tolerate approximate answers, analysts are interested in accelerating these queries via approximations. Unfortunately, standard approximate query processing techniques to accelerate such queries are not applicable because they assume the result of the predicates are available ahead of time. Furthermore, recent work using cheap approximations (i.e., proxies) do not support aggregation queries with predicates. To accelerate aggregation queries with expensive predicates, we develop and analyze a query processing algorithm that leverages proxies (ABae). ABae must account for the key challenge that it may sample records that do not satisfy the predicate. To address this challenge, we first use the proxy to group records into strata so that records satisfying the predicate are ideally grouped into few strata. Given these strata, ABae uses pilot sampling and plugin estimates to sample according to the optimal allocation. We show that ABae converges at an optimal rate in a novel analysis of stratified sampling with draws that may not satisfy the predicate. We further show that ABae outperforms on baselines on six real-world datasets, reducing labeling costs by up to 2.3×.
UR - http://www.scopus.com/inward/record.url?scp=85119694328&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85119694328&partnerID=8YFLogxK
U2 - 10.14778/3476249.3476285
DO - 10.14778/3476249.3476285
M3 - Conference article
AN - SCOPUS:85119694328
SN - 2150-8097
VL - 14
SP - 2341
EP - 2354
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 11
T2 - 47th International Conference on Very Large Data Bases, VLDB 2021
Y2 - 16 August 2021 through 20 August 2021
ER -