Orchestrating an ensemble of mapreduce jobs for minimizing their makespan

Abhishek Verma, Ludmila Cherkasova, Roy H. Campbell

Research output: Contribution to journalArticle

Abstract

Cloud computing offers an attractive option for businesses to rent a suitable size MapReduce cluster, consume resources as a service, and pay only for resources that were consumed. A key challenge in such environments is to increase the utilization of MapReduce clusters to minimize their cost. One way of achieving this goal is to optimize the execution of Mapreduce jobs on the cluster. For a set of production jobs that are executed periodically on new data, we can perform an offline analysis for evaluating performance benefits of different optimization techniques. In this work, we consider a subset of production workloads that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We introduce a simple abstraction where each MapReduce job is represented as a pair of map and reduce stage durations. This representation enables us to apply the classic Johnson algorithm that was designed for building an optimal two-stage job schedule. We evaluate the performance benefits of the constructed schedule through an extensive set of simulations over a variety of realistic workloads. The results are workload and cluster-size dependent, but it is typical to achieve up to 10-25 percent of makespan improvements by simply processing the jobs in the right order. However, in some cases, the simplified abstraction assumed by Johnson's algorithm may lead to a suboptimal job schedule. We design a novel heuristic, called BalancedPools, that significantly improves Johnson's schedule results (up to 15-38 percent), exactly in the situations when it produces suboptimal makespan. Overall, we observe up to 50 percent in the makespan improvements with the new BalancedPools algorithm. The results of our simulation study are validated through experiments on a 66-node Hadoop cluster.

Original languageEnglish (US)
Article number6461893
Pages (from-to)314-327
Number of pages14
JournalIEEE Transactions on Dependable and Secure Computing
Volume10
Issue number5
DOIs
StatePublished - Sep 17 2013

Fingerprint

Cloud computing
Processing
Costs
Industry
Experiments

Keywords

  • Batch workloads
  • Hadoop
  • Map reduce
  • Minimized makespan
  • Optimized schedule

ASJC Scopus subject areas

  • Electrical and Electronic Engineering

Cite this

Orchestrating an ensemble of mapreduce jobs for minimizing their makespan. / Verma, Abhishek; Cherkasova, Ludmila; Campbell, Roy H.

In: IEEE Transactions on Dependable and Secure Computing, Vol. 10, No. 5, 6461893, 17.09.2013, p. 314-327.

Research output: Contribution to journalArticle

@article{e779a73eb48b40bd99c76beda2937d66,
title = "Orchestrating an ensemble of mapreduce jobs for minimizing their makespan",
abstract = "Cloud computing offers an attractive option for businesses to rent a suitable size MapReduce cluster, consume resources as a service, and pay only for resources that were consumed. A key challenge in such environments is to increase the utilization of MapReduce clusters to minimize their cost. One way of achieving this goal is to optimize the execution of Mapreduce jobs on the cluster. For a set of production jobs that are executed periodically on new data, we can perform an offline analysis for evaluating performance benefits of different optimization techniques. In this work, we consider a subset of production workloads that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We introduce a simple abstraction where each MapReduce job is represented as a pair of map and reduce stage durations. This representation enables us to apply the classic Johnson algorithm that was designed for building an optimal two-stage job schedule. We evaluate the performance benefits of the constructed schedule through an extensive set of simulations over a variety of realistic workloads. The results are workload and cluster-size dependent, but it is typical to achieve up to 10-25 percent of makespan improvements by simply processing the jobs in the right order. However, in some cases, the simplified abstraction assumed by Johnson's algorithm may lead to a suboptimal job schedule. We design a novel heuristic, called BalancedPools, that significantly improves Johnson's schedule results (up to 15-38 percent), exactly in the situations when it produces suboptimal makespan. Overall, we observe up to 50 percent in the makespan improvements with the new BalancedPools algorithm. The results of our simulation study are validated through experiments on a 66-node Hadoop cluster.",
keywords = "Batch workloads, Hadoop, Map reduce, Minimized makespan, Optimized schedule",
author = "Abhishek Verma and Ludmila Cherkasova and Campbell, {Roy H.}",
year = "2013",
month = "9",
day = "17",
doi = "10.1109/TDSC.2013.14",
language = "English (US)",
volume = "10",
pages = "314--327",
journal = "IEEE Transactions on Dependable and Secure Computing",
issn = "1545-5971",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "5",

}

TY - JOUR

T1 - Orchestrating an ensemble of mapreduce jobs for minimizing their makespan

AU - Verma, Abhishek

AU - Cherkasova, Ludmila

AU - Campbell, Roy H.

PY - 2013/9/17

Y1 - 2013/9/17

N2 - Cloud computing offers an attractive option for businesses to rent a suitable size MapReduce cluster, consume resources as a service, and pay only for resources that were consumed. A key challenge in such environments is to increase the utilization of MapReduce clusters to minimize their cost. One way of achieving this goal is to optimize the execution of Mapreduce jobs on the cluster. For a set of production jobs that are executed periodically on new data, we can perform an offline analysis for evaluating performance benefits of different optimization techniques. In this work, we consider a subset of production workloads that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We introduce a simple abstraction where each MapReduce job is represented as a pair of map and reduce stage durations. This representation enables us to apply the classic Johnson algorithm that was designed for building an optimal two-stage job schedule. We evaluate the performance benefits of the constructed schedule through an extensive set of simulations over a variety of realistic workloads. The results are workload and cluster-size dependent, but it is typical to achieve up to 10-25 percent of makespan improvements by simply processing the jobs in the right order. However, in some cases, the simplified abstraction assumed by Johnson's algorithm may lead to a suboptimal job schedule. We design a novel heuristic, called BalancedPools, that significantly improves Johnson's schedule results (up to 15-38 percent), exactly in the situations when it produces suboptimal makespan. Overall, we observe up to 50 percent in the makespan improvements with the new BalancedPools algorithm. The results of our simulation study are validated through experiments on a 66-node Hadoop cluster.

AB - Cloud computing offers an attractive option for businesses to rent a suitable size MapReduce cluster, consume resources as a service, and pay only for resources that were consumed. A key challenge in such environments is to increase the utilization of MapReduce clusters to minimize their cost. One way of achieving this goal is to optimize the execution of Mapreduce jobs on the cluster. For a set of production jobs that are executed periodically on new data, we can perform an offline analysis for evaluating performance benefits of different optimization techniques. In this work, we consider a subset of production workloads that consists of MapReduce jobs with no dependencies. We observe that the order in which these jobs are executed can have a significant impact on their overall completion time and the cluster resource utilization. Our goal is to automate the design of a job schedule that minimizes the completion time (makespan) of such a set of MapReduce jobs. We introduce a simple abstraction where each MapReduce job is represented as a pair of map and reduce stage durations. This representation enables us to apply the classic Johnson algorithm that was designed for building an optimal two-stage job schedule. We evaluate the performance benefits of the constructed schedule through an extensive set of simulations over a variety of realistic workloads. The results are workload and cluster-size dependent, but it is typical to achieve up to 10-25 percent of makespan improvements by simply processing the jobs in the right order. However, in some cases, the simplified abstraction assumed by Johnson's algorithm may lead to a suboptimal job schedule. We design a novel heuristic, called BalancedPools, that significantly improves Johnson's schedule results (up to 15-38 percent), exactly in the situations when it produces suboptimal makespan. Overall, we observe up to 50 percent in the makespan improvements with the new BalancedPools algorithm. The results of our simulation study are validated through experiments on a 66-node Hadoop cluster.

KW - Batch workloads

KW - Hadoop

KW - Map reduce

KW - Minimized makespan

KW - Optimized schedule

UR - http://www.scopus.com/inward/record.url?scp=84883770409&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84883770409&partnerID=8YFLogxK

U2 - 10.1109/TDSC.2013.14

DO - 10.1109/TDSC.2013.14

M3 - Article

AN - SCOPUS:84883770409

VL - 10

SP - 314

EP - 327

JO - IEEE Transactions on Dependable and Secure Computing

JF - IEEE Transactions on Dependable and Secure Computing

SN - 1545-5971

IS - 5

M1 - 6461893

ER -