Predictive data and energy management in GreenHDFS

Rini T. Kaushik, Tarek Abdelzaher, Ryota Egashira, Klara Nahrstedt

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The sheer scale and rapid rise of Big Data mandates highly scalable, self-adaptive, and energy-conserving data-intensive compute clusters. Based on our analysis of the traces from a production Hadoop cluster at Yahoo!, we observe that file size, file lifespan, and file heat are statistically correlated and very strongly associated with the hierarchical directory structure (i.e., absolute file path) in which the files are organized. Leveraging that observation, we present predictive GreenHDFS; an energy-conserving variant of the Hadoop distributed file system that uses a supervised machine learning technique to learn the correlation between the directory hierarchy and the file attributes to guide novel predictive file zone placement, migration, and replication policies that significantly outperform the current state-of-the-art reactive approaches. Using real-world traces from a large-scale (2600 servers, 5 Petabytes) production Hadoop cluster at Yahoo! in our GreenHDFS simulations, we show how predictive GreenHDFS results in a much better trade-off between performance and energy consumption.

Original languageEnglish (US)
Title of host publication2011 International Green Computing Conference and Workshops, IGCC 2011
DOIs
StatePublished - Sep 30 2011
Event2011 International Green Computing Conference, IGCC 2011 - Orlando, FL, United States
Duration: Jul 25 2011Jul 28 2011

Publication series

Name2011 International Green Computing Conference and Workshops, IGCC 2011

Other

Other2011 International Green Computing Conference, IGCC 2011
CountryUnited States
CityOrlando, FL
Period7/25/117/28/11

Fingerprint

Energy management
Information management
Learning systems
Servers
Energy utilization
Big data
Hot Temperature

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Environmental Engineering

Cite this

Kaushik, R. T., Abdelzaher, T., Egashira, R., & Nahrstedt, K. (2011). Predictive data and energy management in GreenHDFS. In 2011 International Green Computing Conference and Workshops, IGCC 2011 [6008563] (2011 International Green Computing Conference and Workshops, IGCC 2011). https://doi.org/10.1109/IGCC.2011.6008563

Predictive data and energy management in GreenHDFS. / Kaushik, Rini T.; Abdelzaher, Tarek; Egashira, Ryota; Nahrstedt, Klara.

2011 International Green Computing Conference and Workshops, IGCC 2011. 2011. 6008563 (2011 International Green Computing Conference and Workshops, IGCC 2011).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kaushik, RT, Abdelzaher, T, Egashira, R & Nahrstedt, K 2011, Predictive data and energy management in GreenHDFS. in 2011 International Green Computing Conference and Workshops, IGCC 2011., 6008563, 2011 International Green Computing Conference and Workshops, IGCC 2011, 2011 International Green Computing Conference, IGCC 2011, Orlando, FL, United States, 7/25/11. https://doi.org/10.1109/IGCC.2011.6008563
Kaushik RT, Abdelzaher T, Egashira R, Nahrstedt K. Predictive data and energy management in GreenHDFS. In 2011 International Green Computing Conference and Workshops, IGCC 2011. 2011. 6008563. (2011 International Green Computing Conference and Workshops, IGCC 2011). https://doi.org/10.1109/IGCC.2011.6008563
Kaushik, Rini T. ; Abdelzaher, Tarek ; Egashira, Ryota ; Nahrstedt, Klara. / Predictive data and energy management in GreenHDFS. 2011 International Green Computing Conference and Workshops, IGCC 2011. 2011. (2011 International Green Computing Conference and Workshops, IGCC 2011).
@inproceedings{ad307e0748cd42948e1c00986476f0d3,
title = "Predictive data and energy management in GreenHDFS",
abstract = "The sheer scale and rapid rise of Big Data mandates highly scalable, self-adaptive, and energy-conserving data-intensive compute clusters. Based on our analysis of the traces from a production Hadoop cluster at Yahoo!, we observe that file size, file lifespan, and file heat are statistically correlated and very strongly associated with the hierarchical directory structure (i.e., absolute file path) in which the files are organized. Leveraging that observation, we present predictive GreenHDFS; an energy-conserving variant of the Hadoop distributed file system that uses a supervised machine learning technique to learn the correlation between the directory hierarchy and the file attributes to guide novel predictive file zone placement, migration, and replication policies that significantly outperform the current state-of-the-art reactive approaches. Using real-world traces from a large-scale (2600 servers, 5 Petabytes) production Hadoop cluster at Yahoo! in our GreenHDFS simulations, we show how predictive GreenHDFS results in a much better trade-off between performance and energy consumption.",
author = "Kaushik, {Rini T.} and Tarek Abdelzaher and Ryota Egashira and Klara Nahrstedt",
year = "2011",
month = "9",
day = "30",
doi = "10.1109/IGCC.2011.6008563",
language = "English (US)",
isbn = "9781457712203",
series = "2011 International Green Computing Conference and Workshops, IGCC 2011",
booktitle = "2011 International Green Computing Conference and Workshops, IGCC 2011",

}

TY - GEN

T1 - Predictive data and energy management in GreenHDFS

AU - Kaushik, Rini T.

AU - Abdelzaher, Tarek

AU - Egashira, Ryota

AU - Nahrstedt, Klara

PY - 2011/9/30

Y1 - 2011/9/30

N2 - The sheer scale and rapid rise of Big Data mandates highly scalable, self-adaptive, and energy-conserving data-intensive compute clusters. Based on our analysis of the traces from a production Hadoop cluster at Yahoo!, we observe that file size, file lifespan, and file heat are statistically correlated and very strongly associated with the hierarchical directory structure (i.e., absolute file path) in which the files are organized. Leveraging that observation, we present predictive GreenHDFS; an energy-conserving variant of the Hadoop distributed file system that uses a supervised machine learning technique to learn the correlation between the directory hierarchy and the file attributes to guide novel predictive file zone placement, migration, and replication policies that significantly outperform the current state-of-the-art reactive approaches. Using real-world traces from a large-scale (2600 servers, 5 Petabytes) production Hadoop cluster at Yahoo! in our GreenHDFS simulations, we show how predictive GreenHDFS results in a much better trade-off between performance and energy consumption.

AB - The sheer scale and rapid rise of Big Data mandates highly scalable, self-adaptive, and energy-conserving data-intensive compute clusters. Based on our analysis of the traces from a production Hadoop cluster at Yahoo!, we observe that file size, file lifespan, and file heat are statistically correlated and very strongly associated with the hierarchical directory structure (i.e., absolute file path) in which the files are organized. Leveraging that observation, we present predictive GreenHDFS; an energy-conserving variant of the Hadoop distributed file system that uses a supervised machine learning technique to learn the correlation between the directory hierarchy and the file attributes to guide novel predictive file zone placement, migration, and replication policies that significantly outperform the current state-of-the-art reactive approaches. Using real-world traces from a large-scale (2600 servers, 5 Petabytes) production Hadoop cluster at Yahoo! in our GreenHDFS simulations, we show how predictive GreenHDFS results in a much better trade-off between performance and energy consumption.

UR - http://www.scopus.com/inward/record.url?scp=80053218141&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80053218141&partnerID=8YFLogxK

U2 - 10.1109/IGCC.2011.6008563

DO - 10.1109/IGCC.2011.6008563

M3 - Conference contribution

AN - SCOPUS:80053218141

SN - 9781457712203

T3 - 2011 International Green Computing Conference and Workshops, IGCC 2011

BT - 2011 International Green Computing Conference and Workshops, IGCC 2011

ER -