WANalytics: Analytics for a geo-distributed data-intensive world

Ashish Vulimiri, Carlo Curino, Brighten Godfrey, Konstantinos Karanasos, George Varghese

Research output: Contribution to conferencePaper

Abstract

Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions. To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workflow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257× reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.

Original languageEnglish (US)
StatePublished - Jan 1 2015
Event7th Biennial Conference on Innovative Data Systems Research, CIDR 2015 - Asilomar, United States
Duration: Jan 4 2015Jan 7 2015

Conference

Conference7th Biennial Conference on Innovative Data Systems Research, CIDR 2015
CountryUnited States
CityAsilomar
Period1/4/151/7/15

Fingerprint

Bandwidth
Copying
Wide area networks
Big data
Data center
Workload
Costs
Privacy
Trigger
Viability
Globe
Sovereignty
Prototype
Evaluation
Microsoft
Hadoop
Government regulation
Benchmark

ASJC Scopus subject areas

  • Information Systems and Management
  • Hardware and Architecture
  • Artificial Intelligence
  • Information Systems

Cite this

Vulimiri, A., Curino, C., Godfrey, B., Karanasos, K., & Varghese, G. (2015). WANalytics: Analytics for a geo-distributed data-intensive world. Paper presented at 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, United States.

WANalytics : Analytics for a geo-distributed data-intensive world. / Vulimiri, Ashish; Curino, Carlo; Godfrey, Brighten; Karanasos, Konstantinos; Varghese, George.

2015. Paper presented at 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, United States.

Research output: Contribution to conferencePaper

Vulimiri, A, Curino, C, Godfrey, B, Karanasos, K & Varghese, G 2015, 'WANalytics: Analytics for a geo-distributed data-intensive world', Paper presented at 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, United States, 1/4/15 - 1/7/15.
Vulimiri A, Curino C, Godfrey B, Karanasos K, Varghese G. WANalytics: Analytics for a geo-distributed data-intensive world. 2015. Paper presented at 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, United States.
Vulimiri, Ashish ; Curino, Carlo ; Godfrey, Brighten ; Karanasos, Konstantinos ; Varghese, George. / WANalytics : Analytics for a geo-distributed data-intensive world. Paper presented at 7th Biennial Conference on Innovative Data Systems Research, CIDR 2015, Asilomar, United States.
@conference{b6ba704c89d74ef9963131cf54e0b79a,
title = "WANalytics: Analytics for a geo-distributed data-intensive world",
abstract = "Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions. To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workflow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257× reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.",
author = "Ashish Vulimiri and Carlo Curino and Brighten Godfrey and Konstantinos Karanasos and George Varghese",
year = "2015",
month = "1",
day = "1",
language = "English (US)",
note = "7th Biennial Conference on Innovative Data Systems Research, CIDR 2015 ; Conference date: 04-01-2015 Through 07-01-2015",

}

TY - CONF

T1 - WANalytics

T2 - Analytics for a geo-distributed data-intensive world

AU - Vulimiri, Ashish

AU - Curino, Carlo

AU - Godfrey, Brighten

AU - Karanasos, Konstantinos

AU - Varghese, George

PY - 2015/1/1

Y1 - 2015/1/1

N2 - Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions. To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workflow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257× reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.

AB - Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions. To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workflow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257× reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.

UR - http://www.scopus.com/inward/record.url?scp=85050991594&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85050991594&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85050991594

ER -