DARE: Adaptive data replication for efficient cluster scheduling

Cristina L. Abad, Yi Lu, Roy H. Campbell

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Placing data as close as possible to computation is a common practice of data intensive systems, commonly referred to as the data locality problem. By analyzing existing production systems, we confirm the benefit of data locality and find that data have different popularity and varying correlation of accesses. We propose DARE, a distributed adaptive data replication algorithm that aids the scheduler to achieve better data locality. DARE solves two problems, how many replicas to allocate for each file and where to place them, using probabilistic sampling and a competitive aging algorithm independently at each node. It takes advantage of existing remote data accesses in the system and incurs no extra network usage. Using two mixed workload traces from Face book, we show that DARE improves data locality by more than 7 times with the FIFO scheduler in Hadoop and achieves more than 85% data locality for the FAIR scheduler with delay scheduling. Turnaround time and job slowdown are reduced by 19% and 25%, respectively.

Original languageEnglish (US)
Title of host publicationProceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011
Pages159-168
Number of pages10
DOIs
StatePublished - Nov 16 2011
Event2011 IEEE International Conference on Cluster Computing, CLUSTER 2011 - Austin, TX, United States
Duration: Sep 26 2011Sep 30 2011

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
ISSN (Print)1552-5244

Other

Other2011 IEEE International Conference on Cluster Computing, CLUSTER 2011
CountryUnited States
CityAustin, TX
Period9/26/119/30/11

Fingerprint

Scheduling
Turnaround time
Aging of materials
Sampling

Keywords

  • MapReduce
  • locality
  • replication
  • scheduling

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Cite this

Abad, C. L., Lu, Y., & Campbell, R. H. (2011). DARE: Adaptive data replication for efficient cluster scheduling. In Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011 (pp. 159-168). [6061051] (Proceedings - IEEE International Conference on Cluster Computing, ICCC). https://doi.org/10.1109/CLUSTER.2011.26

DARE : Adaptive data replication for efficient cluster scheduling. / Abad, Cristina L.; Lu, Yi; Campbell, Roy H.

Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011. 2011. p. 159-168 6061051 (Proceedings - IEEE International Conference on Cluster Computing, ICCC).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abad, CL, Lu, Y & Campbell, RH 2011, DARE: Adaptive data replication for efficient cluster scheduling. in Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011., 6061051, Proceedings - IEEE International Conference on Cluster Computing, ICCC, pp. 159-168, 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011, Austin, TX, United States, 9/26/11. https://doi.org/10.1109/CLUSTER.2011.26
Abad CL, Lu Y, Campbell RH. DARE: Adaptive data replication for efficient cluster scheduling. In Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011. 2011. p. 159-168. 6061051. (Proceedings - IEEE International Conference on Cluster Computing, ICCC). https://doi.org/10.1109/CLUSTER.2011.26
Abad, Cristina L. ; Lu, Yi ; Campbell, Roy H. / DARE : Adaptive data replication for efficient cluster scheduling. Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011. 2011. pp. 159-168 (Proceedings - IEEE International Conference on Cluster Computing, ICCC).
@inproceedings{79a1d729c7fd4a69bfad8b8ba36bdf5c,
title = "DARE: Adaptive data replication for efficient cluster scheduling",
abstract = "Placing data as close as possible to computation is a common practice of data intensive systems, commonly referred to as the data locality problem. By analyzing existing production systems, we confirm the benefit of data locality and find that data have different popularity and varying correlation of accesses. We propose DARE, a distributed adaptive data replication algorithm that aids the scheduler to achieve better data locality. DARE solves two problems, how many replicas to allocate for each file and where to place them, using probabilistic sampling and a competitive aging algorithm independently at each node. It takes advantage of existing remote data accesses in the system and incurs no extra network usage. Using two mixed workload traces from Face book, we show that DARE improves data locality by more than 7 times with the FIFO scheduler in Hadoop and achieves more than 85{\%} data locality for the FAIR scheduler with delay scheduling. Turnaround time and job slowdown are reduced by 19{\%} and 25{\%}, respectively.",
keywords = "MapReduce, locality, replication, scheduling",
author = "Abad, {Cristina L.} and Yi Lu and Campbell, {Roy H.}",
year = "2011",
month = "11",
day = "16",
doi = "10.1109/CLUSTER.2011.26",
language = "English (US)",
isbn = "9780769545165",
series = "Proceedings - IEEE International Conference on Cluster Computing, ICCC",
pages = "159--168",
booktitle = "Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011",

}

TY - GEN

T1 - DARE

T2 - Adaptive data replication for efficient cluster scheduling

AU - Abad, Cristina L.

AU - Lu, Yi

AU - Campbell, Roy H.

PY - 2011/11/16

Y1 - 2011/11/16

N2 - Placing data as close as possible to computation is a common practice of data intensive systems, commonly referred to as the data locality problem. By analyzing existing production systems, we confirm the benefit of data locality and find that data have different popularity and varying correlation of accesses. We propose DARE, a distributed adaptive data replication algorithm that aids the scheduler to achieve better data locality. DARE solves two problems, how many replicas to allocate for each file and where to place them, using probabilistic sampling and a competitive aging algorithm independently at each node. It takes advantage of existing remote data accesses in the system and incurs no extra network usage. Using two mixed workload traces from Face book, we show that DARE improves data locality by more than 7 times with the FIFO scheduler in Hadoop and achieves more than 85% data locality for the FAIR scheduler with delay scheduling. Turnaround time and job slowdown are reduced by 19% and 25%, respectively.

AB - Placing data as close as possible to computation is a common practice of data intensive systems, commonly referred to as the data locality problem. By analyzing existing production systems, we confirm the benefit of data locality and find that data have different popularity and varying correlation of accesses. We propose DARE, a distributed adaptive data replication algorithm that aids the scheduler to achieve better data locality. DARE solves two problems, how many replicas to allocate for each file and where to place them, using probabilistic sampling and a competitive aging algorithm independently at each node. It takes advantage of existing remote data accesses in the system and incurs no extra network usage. Using two mixed workload traces from Face book, we show that DARE improves data locality by more than 7 times with the FIFO scheduler in Hadoop and achieves more than 85% data locality for the FAIR scheduler with delay scheduling. Turnaround time and job slowdown are reduced by 19% and 25%, respectively.

KW - MapReduce

KW - locality

KW - replication

KW - scheduling

UR - http://www.scopus.com/inward/record.url?scp=80955123462&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=80955123462&partnerID=8YFLogxK

U2 - 10.1109/CLUSTER.2011.26

DO - 10.1109/CLUSTER.2011.26

M3 - Conference contribution

AN - SCOPUS:80955123462

SN - 9780769545165

T3 - Proceedings - IEEE International Conference on Cluster Computing, ICCC

SP - 159

EP - 168

BT - Proceedings - 2011 IEEE International Conference on Cluster Computing, CLUSTER 2011

ER -