Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance

Abhishek Verma, Ludmila Cherkasova, Roy H. Campbell

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Large-scale MapReduce clusters that routinely process petabytes of unstructured and semi-structured data represent a new entity in the changing landscape of clouds. A key challenge is to increase the utilization of these MapReduce clusters. In this work, we consider a subset of the production workload that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We offer a novel abstraction framework and a heuristic, called BalancedPools, that efficiently utilizes performance properties of MapReduce jobs in a given workload for constructing an optimized job schedule. Simulations performed over a realistic workload demonstrate that 15%-38% makespan improvements are achievable by simply processing the jobs in the right order.

Original languageEnglish (US)
Title of host publicationProceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012
Pages11-18
Number of pages8
DOIs
StatePublished - Nov 6 2012
Event2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012 - Washington, DC, United States
Duration: Aug 7 2012Aug 9 2012

Publication series

NameProceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012

Other

Other2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012
CountryUnited States
CityWashington, DC
Period8/7/128/9/12

Fingerprint

MapReduce
Schedule
Minimise
Workload
Processing
performance
Completion Time
workload
Semistructured Data
utilization
abstraction
Heuristics
heuristics
Resources
Subset
Demonstrate
simulation
Simulation
resources

Keywords

  • Hadoop
  • MapReduce
  • batch workloads
  • minimized makespan
  • optimized schedule

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Modeling and Simulation
  • Communication

Cite this

Verma, A., Cherkasova, L., & Campbell, R. H. (2012). Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance. In Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012 (pp. 11-18). [6298160] (Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012). https://doi.org/10.1109/MASCOTS.2012.12

Two sides of a coin : Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance. / Verma, Abhishek; Cherkasova, Ludmila; Campbell, Roy H.

Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012. 2012. p. 11-18 6298160 (Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Verma, A, Cherkasova, L & Campbell, RH 2012, Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance. in Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012., 6298160, Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012, pp. 11-18, 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012, Washington, DC, United States, 8/7/12. https://doi.org/10.1109/MASCOTS.2012.12
Verma A, Cherkasova L, Campbell RH. Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance. In Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012. 2012. p. 11-18. 6298160. (Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012). https://doi.org/10.1109/MASCOTS.2012.12
Verma, Abhishek ; Cherkasova, Ludmila ; Campbell, Roy H. / Two sides of a coin : Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance. Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012. 2012. pp. 11-18 (Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012).
@inproceedings{93205758950d4be3be1988ef26f0275c,
title = "Two sides of a coin: Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance",
abstract = "Large-scale MapReduce clusters that routinely process petabytes of unstructured and semi-structured data represent a new entity in the changing landscape of clouds. A key challenge is to increase the utilization of these MapReduce clusters. In this work, we consider a subset of the production workload that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We offer a novel abstraction framework and a heuristic, called BalancedPools, that efficiently utilizes performance properties of MapReduce jobs in a given workload for constructing an optimized job schedule. Simulations performed over a realistic workload demonstrate that 15{\%}-38{\%} makespan improvements are achievable by simply processing the jobs in the right order.",
keywords = "Hadoop, MapReduce, batch workloads, minimized makespan, optimized schedule",
author = "Abhishek Verma and Ludmila Cherkasova and Campbell, {Roy H.}",
year = "2012",
month = "11",
day = "6",
doi = "10.1109/MASCOTS.2012.12",
language = "English (US)",
isbn = "9780769547930",
series = "Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012",
pages = "11--18",
booktitle = "Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012",

}

TY - GEN

T1 - Two sides of a coin

T2 - Optimizing the schedule of mapreduce jobs to minimize their makespan and improve cluster performance

AU - Verma, Abhishek

AU - Cherkasova, Ludmila

AU - Campbell, Roy H.

PY - 2012/11/6

Y1 - 2012/11/6

N2 - Large-scale MapReduce clusters that routinely process petabytes of unstructured and semi-structured data represent a new entity in the changing landscape of clouds. A key challenge is to increase the utilization of these MapReduce clusters. In this work, we consider a subset of the production workload that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We offer a novel abstraction framework and a heuristic, called BalancedPools, that efficiently utilizes performance properties of MapReduce jobs in a given workload for constructing an optimized job schedule. Simulations performed over a realistic workload demonstrate that 15%-38% makespan improvements are achievable by simply processing the jobs in the right order.

AB - Large-scale MapReduce clusters that routinely process petabytes of unstructured and semi-structured data represent a new entity in the changing landscape of clouds. A key challenge is to increase the utilization of these MapReduce clusters. In this work, we consider a subset of the production workload that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We offer a novel abstraction framework and a heuristic, called BalancedPools, that efficiently utilizes performance properties of MapReduce jobs in a given workload for constructing an optimized job schedule. Simulations performed over a realistic workload demonstrate that 15%-38% makespan improvements are achievable by simply processing the jobs in the right order.

KW - Hadoop

KW - MapReduce

KW - batch workloads

KW - minimized makespan

KW - optimized schedule

UR - http://www.scopus.com/inward/record.url?scp=84868221206&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84868221206&partnerID=8YFLogxK

U2 - 10.1109/MASCOTS.2012.12

DO - 10.1109/MASCOTS.2012.12

M3 - Conference contribution

AN - SCOPUS:84868221206

SN - 9780769547930

T3 - Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012

SP - 11

EP - 18

BT - Proceedings of the 2012 IEEE 20th International Symposium on Modeling, Analysis and Simulation of Computer and Telecommunication Systems, MASCOTS 2012

ER -