TY - GEN
T1 - Database Learning
T2 - 2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017
AU - Park, Yongjoo
AU - Tajik, Ahmad Shahab
AU - Cafarella, Michael
AU - Mozafari, Barzan
N1 - Publisher Copyright:
© 2017 ACM.
Copyright:
Copyright 2018 Elsevier B.V., All rights reserved.
PY - 2017/5/9
Y1 - 2017/5/9
N2 - In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each queryreveals some degree ofknowledgeabout the answer toanother query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea-learning from past query answers-Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0× for the same accuracy level compared to existing AQP systems.
AB - In today's databases, previous query answers rarely benefit answering future queries. For the first time, to the best of our knowledge, we change this paradigm in an approximate query processing (AQP) context. We make the following observation: the answer to each queryreveals some degree ofknowledgeabout the answer toanother query because their answers stem from the same underlying distribution that has produced the entire dataset. Exploiting and refining this knowledge should allow us to answer queries more analytically, rather than by reading enormous amounts of raw data. Also, processing more queries should continuously enhance our knowledge of the underlying distribution, and hence lead to increasingly faster response times for future queries. We call this novel idea-learning from past query answers-Database Learning. We exploit the principle of maximum entropy to produce answers, which are in expectation guaranteed to be more accurate than existing sample-based approximations. Empowered by this idea, we build a query engine on top of Spark SQL, called Verdict. We conduct extensive experiments on real-world query traces from a large customer of a major database vendor. Our results demonstrate that Verdict supports 73.7% of these queries, speeding them up by up to 23.0× for the same accuracy level compared to existing AQP systems.
UR - http://www.scopus.com/inward/record.url?scp=85021223729&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85021223729&partnerID=8YFLogxK
U2 - 10.1145/3035918.3064013
DO - 10.1145/3035918.3064013
M3 - Conference contribution
AN - SCOPUS:85021223729
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 587
EP - 602
BT - SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data
PB - Association for Computing Machinery
Y2 - 14 May 2017 through 19 May 2017
ER -