TY - GEN
T1 - Predictive data and energy management in GreenHDFS
AU - Kaushik, Rini T.
AU - Abdelzaher, Tarek
AU - Egashira, Ryota
AU - Nahrstedt, Klara
PY - 2011
Y1 - 2011
N2 - The sheer scale and rapid rise of Big Data mandates highly scalable, self-adaptive, and energy-conserving data-intensive compute clusters. Based on our analysis of the traces from a production Hadoop cluster at Yahoo!, we observe that file size, file lifespan, and file heat are statistically correlated and very strongly associated with the hierarchical directory structure (i.e., absolute file path) in which the files are organized. Leveraging that observation, we present predictive GreenHDFS; an energy-conserving variant of the Hadoop distributed file system that uses a supervised machine learning technique to learn the correlation between the directory hierarchy and the file attributes to guide novel predictive file zone placement, migration, and replication policies that significantly outperform the current state-of-the-art reactive approaches. Using real-world traces from a large-scale (2600 servers, 5 Petabytes) production Hadoop cluster at Yahoo! in our GreenHDFS simulations, we show how predictive GreenHDFS results in a much better trade-off between performance and energy consumption.
AB - The sheer scale and rapid rise of Big Data mandates highly scalable, self-adaptive, and energy-conserving data-intensive compute clusters. Based on our analysis of the traces from a production Hadoop cluster at Yahoo!, we observe that file size, file lifespan, and file heat are statistically correlated and very strongly associated with the hierarchical directory structure (i.e., absolute file path) in which the files are organized. Leveraging that observation, we present predictive GreenHDFS; an energy-conserving variant of the Hadoop distributed file system that uses a supervised machine learning technique to learn the correlation between the directory hierarchy and the file attributes to guide novel predictive file zone placement, migration, and replication policies that significantly outperform the current state-of-the-art reactive approaches. Using real-world traces from a large-scale (2600 servers, 5 Petabytes) production Hadoop cluster at Yahoo! in our GreenHDFS simulations, we show how predictive GreenHDFS results in a much better trade-off between performance and energy consumption.
UR - http://www.scopus.com/inward/record.url?scp=80053218141&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80053218141&partnerID=8YFLogxK
U2 - 10.1109/IGCC.2011.6008563
DO - 10.1109/IGCC.2011.6008563
M3 - Conference contribution
AN - SCOPUS:80053218141
SN - 9781457712203
T3 - 2011 International Green Computing Conference and Workshops, IGCC 2011
BT - 2011 International Green Computing Conference and Workshops, IGCC 2011
T2 - 2011 International Green Computing Conference, IGCC 2011
Y2 - 25 July 2011 through 28 July 2011
ER -