TY - JOUR
T1 - Approximate selection with guarantees using proxies
AU - Kang, Daniel
AU - Gan, Edward
AU - Bailis, Peter
AU - Hashimoto, Tatsunori
AU - Zaharia, Matei
N1 - Funding Information:
We thank Sahaana Suri, Kexin Rong, and members of the Stanford Infolab for their feedback on early drafts. We further thank Tadashi Fukami, Trevor Hebert, and IsaacWestlund for their helpful discussions. The hummingbird data was collected by Kaoru Tsuji, Trevor Hebert, and Tadashi Fukami, funded by a Kyoto University Foundation grant and an NSF Dimensions of Biodiversity grant (DEB-1737758). This research was supported in part by affiliate members and other supporters of the Stanford DAWN project-Ant Financial, Facebook, Google, Infosys, NEC, and VMware-as well as Toyota Research Institute, Northrop Grumman, Amazon Web Services, Cisco, and the NSF under CAREER grant CNS-1651570. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the NSF. Toyota Research Institute ("TRI") provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.
Publisher Copyright:
© 2020, VLDB Endowment.
PY - 2020/7/1
Y1 - 2020/7/1
N2 - Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect long videos to identify relevant objects and events. To reduce this cost, recent work proposes to use cheap proxy models, such as image classifiers, to identify an approximate set of data points satisfying a data selection filter. Unfortunately, this recent work does not provide the statistical accuracy guarantees necessary in scientific and production settings. In this work, we introduce novel algorithms for approximate selection queries with statistical accuracy guarantees. Namely, given a limited number of exact identifications from an oracle, often a human or an expensive machine learning model, our algorithms meet a minimum precision or recall target with high probability. In contrast, existing approaches can catastrophically fail in satisfying these recall and precision targets. We show that our algorithms can improve query result quality by up to 30 x for both the precision and recall targets in both real and synthetic datasets.
AB - Due to the falling costs of data acquisition and storage, researchers and industry analysts often want to find all instances of rare events in large datasets. For instance, scientists can cheaply capture thousands of hours of video, but are limited by the need to manually inspect long videos to identify relevant objects and events. To reduce this cost, recent work proposes to use cheap proxy models, such as image classifiers, to identify an approximate set of data points satisfying a data selection filter. Unfortunately, this recent work does not provide the statistical accuracy guarantees necessary in scientific and production settings. In this work, we introduce novel algorithms for approximate selection queries with statistical accuracy guarantees. Namely, given a limited number of exact identifications from an oracle, often a human or an expensive machine learning model, our algorithms meet a minimum precision or recall target with high probability. In contrast, existing approaches can catastrophically fail in satisfying these recall and precision targets. We show that our algorithms can improve query result quality by up to 30 x for both the precision and recall targets in both real and synthetic datasets.
UR - http://www.scopus.com/inward/record.url?scp=85091117927&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85091117927&partnerID=8YFLogxK
U2 - 10.14778/3407790.3407804
DO - 10.14778/3407790.3407804
M3 - Article
AN - SCOPUS:85091117927
SN - 2150-8097
VL - 13
SP - 1990
EP - 2003
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 11
ER -