TY - GEN
T1 - Global analytics in the face of bandwidth and regulatory constraints
AU - Vulimiri, Ashish
AU - Curino, Carlo
AU - Godfrey, Brighten
AU - Jungblut, Thomas
AU - Padhye, Jitu
AU - Varghese, George
N1 - Publisher Copyright:
© 2015 by The USENIX Association. All Rights Reserved.
PY - 2015
Y1 - 2015
N2 - Global-scale organizations produce large volumes of data across geographically distributed data centers. Querying and analyzing such data as a whole introduces new research issues at the intersection of networks and databases. Today systems that compute SQL analytics over geographically distributed data operate by pulling all data to a central location. This is problematic at large data scales due to expensive transoceanic links, and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) consists in orchestrating query execution across data centers to minimize bandwidth while respecting regulatory constaints. WABD combines classical query planning with novel network-centric mechanisms designed for a wide-area setting such as pseudo-distributed execution, joint query optimization, and deltas on cached subquery results. Our prototype, Geode, builds upon Hive and uses 250× less bandwidth than centralized analytics in a Microsoft production workload and up to 360× less on popular analytics benchmarks including TPC-CH and Berkeley Big Data. Geode supports all SQL operators, including Joins, across global data.
AB - Global-scale organizations produce large volumes of data across geographically distributed data centers. Querying and analyzing such data as a whole introduces new research issues at the intersection of networks and databases. Today systems that compute SQL analytics over geographically distributed data operate by pulling all data to a central location. This is problematic at large data scales due to expensive transoceanic links, and may be rendered impossible by emerging regulatory constraints. The new problem of Wide-Area Big Data (WABD) consists in orchestrating query execution across data centers to minimize bandwidth while respecting regulatory constaints. WABD combines classical query planning with novel network-centric mechanisms designed for a wide-area setting such as pseudo-distributed execution, joint query optimization, and deltas on cached subquery results. Our prototype, Geode, builds upon Hive and uses 250× less bandwidth than centralized analytics in a Microsoft production workload and up to 360× less on popular analytics benchmarks including TPC-CH and Berkeley Big Data. Geode supports all SQL operators, including Joins, across global data.
UR - http://www.scopus.com/inward/record.url?scp=84967163393&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84967163393&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:84967163393
T3 - Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015
SP - 323
EP - 336
BT - Proceedings of the 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015
PB - USENIX
T2 - 12th USENIX Symposium on Networked Systems Design and Implementation, NSDI 2015
Y2 - 4 May 2015 through 6 May 2015
ER -