A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns

Cristina L. Abad, Nathan Roberts, Yi Lu, Roy H. Campbell

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A huge increase in data storage and processing requirements has lead to Big Data, for which next generation storage systems are being designed and implemented. However, we have a limited understanding of the workloads of Big Data storage systems. We consider the case of one common type of Big Data storage cluster: a cluster dedicated to supporting a mix of MapReduce jobs. We analyze 6-month traces from two large Hadoop clusters at Yahoo! and characterize the file popularity, temporal locality, and arrival patterns of the workloads. We identify several interesting properties and compare them with previous observations from web and media server workloads. To the best of our knowledge, this is the first study of how MapReduce workloads interact with the storage layer.

Original languageEnglish (US)
Title of host publicationProceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012
PublisherIEEE Computer Society
Pages100-109
Number of pages10
ISBN (Print)9781457720642
DOIs
StatePublished - Jan 1 2012
Event2012 IEEE International Symposium on Workload Characterization, IISWC 2012 - San Diego, CA, United States
Duration: Nov 4 2012Nov 6 2012

Publication series

NameProceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012

Other

Other2012 IEEE International Symposium on Workload Characterization, IISWC 2012
CountryUnited States
CitySan Diego, CA
Period11/4/1211/6/12

Fingerprint

Servers
Data storage equipment
Processing
Big data

Keywords

  • Access patterns
  • Big Data
  • HDFS
  • MapReduce

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Cite this

Abad, C. L., Roberts, N., Lu, Y., & Campbell, R. H. (2012). A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns. In Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012 (pp. 100-109). [6402909] (Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012). IEEE Computer Society. https://doi.org/10.1109/IISWC.2012.6402909

A storage-centric analysis of MapReduce workloads : File popularity, temporal locality and arrival patterns. / Abad, Cristina L.; Roberts, Nathan; Lu, Yi; Campbell, Roy H.

Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012. IEEE Computer Society, 2012. p. 100-109 6402909 (Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abad, CL, Roberts, N, Lu, Y & Campbell, RH 2012, A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns. in Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012., 6402909, Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012, IEEE Computer Society, pp. 100-109, 2012 IEEE International Symposium on Workload Characterization, IISWC 2012, San Diego, CA, United States, 11/4/12. https://doi.org/10.1109/IISWC.2012.6402909
Abad CL, Roberts N, Lu Y, Campbell RH. A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns. In Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012. IEEE Computer Society. 2012. p. 100-109. 6402909. (Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012). https://doi.org/10.1109/IISWC.2012.6402909
Abad, Cristina L. ; Roberts, Nathan ; Lu, Yi ; Campbell, Roy H. / A storage-centric analysis of MapReduce workloads : File popularity, temporal locality and arrival patterns. Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012. IEEE Computer Society, 2012. pp. 100-109 (Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012).
@inproceedings{0bad337b4a1442c687bd84207eb96604,
title = "A storage-centric analysis of MapReduce workloads: File popularity, temporal locality and arrival patterns",
abstract = "A huge increase in data storage and processing requirements has lead to Big Data, for which next generation storage systems are being designed and implemented. However, we have a limited understanding of the workloads of Big Data storage systems. We consider the case of one common type of Big Data storage cluster: a cluster dedicated to supporting a mix of MapReduce jobs. We analyze 6-month traces from two large Hadoop clusters at Yahoo! and characterize the file popularity, temporal locality, and arrival patterns of the workloads. We identify several interesting properties and compare them with previous observations from web and media server workloads. To the best of our knowledge, this is the first study of how MapReduce workloads interact with the storage layer.",
keywords = "Access patterns, Big Data, HDFS, MapReduce",
author = "Abad, {Cristina L.} and Nathan Roberts and Yi Lu and Campbell, {Roy H.}",
year = "2012",
month = "1",
day = "1",
doi = "10.1109/IISWC.2012.6402909",
language = "English (US)",
isbn = "9781457720642",
series = "Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012",
publisher = "IEEE Computer Society",
pages = "100--109",
booktitle = "Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012",

}

TY - GEN

T1 - A storage-centric analysis of MapReduce workloads

T2 - File popularity, temporal locality and arrival patterns

AU - Abad, Cristina L.

AU - Roberts, Nathan

AU - Lu, Yi

AU - Campbell, Roy H.

PY - 2012/1/1

Y1 - 2012/1/1

N2 - A huge increase in data storage and processing requirements has lead to Big Data, for which next generation storage systems are being designed and implemented. However, we have a limited understanding of the workloads of Big Data storage systems. We consider the case of one common type of Big Data storage cluster: a cluster dedicated to supporting a mix of MapReduce jobs. We analyze 6-month traces from two large Hadoop clusters at Yahoo! and characterize the file popularity, temporal locality, and arrival patterns of the workloads. We identify several interesting properties and compare them with previous observations from web and media server workloads. To the best of our knowledge, this is the first study of how MapReduce workloads interact with the storage layer.

AB - A huge increase in data storage and processing requirements has lead to Big Data, for which next generation storage systems are being designed and implemented. However, we have a limited understanding of the workloads of Big Data storage systems. We consider the case of one common type of Big Data storage cluster: a cluster dedicated to supporting a mix of MapReduce jobs. We analyze 6-month traces from two large Hadoop clusters at Yahoo! and characterize the file popularity, temporal locality, and arrival patterns of the workloads. We identify several interesting properties and compare them with previous observations from web and media server workloads. To the best of our knowledge, this is the first study of how MapReduce workloads interact with the storage layer.

KW - Access patterns

KW - Big Data

KW - HDFS

KW - MapReduce

UR - http://www.scopus.com/inward/record.url?scp=84873453654&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84873453654&partnerID=8YFLogxK

U2 - 10.1109/IISWC.2012.6402909

DO - 10.1109/IISWC.2012.6402909

M3 - Conference contribution

AN - SCOPUS:84873453654

SN - 9781457720642

T3 - Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012

SP - 100

EP - 109

BT - Proceedings - 2012 IEEE International Symposium on Workload Characterization, IISWC 2012

PB - IEEE Computer Society

ER -