Zorro: zero-cost reactive failure recovery in distributed graph processing

Mayank Pundir, Luke M. Leslie, Indranil Gupta, R H Campbell

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems - PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6-12% of the servers fail, and between 87- 95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.

Original languageEnglish (US)
Title of host publicationACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing
EditorsMagdalena Balazinska, Michael J. Freedman, Sumita Barahmand, Shahram Ghandeharizadeh
PublisherAssociation for Computing Machinery, Inc
Pages195-208
Number of pages14
ISBN (Electronic)9781450336512
DOIs
StatePublished - Aug 27 2015
Event6th ACM Symposium on Cloud Computing, ACM SoCC 2015 - Kohala Coast, United States
Duration: Aug 27 2015Aug 29 2015

Publication series

NameACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing

Other

Other6th ACM Symposium on Cloud Computing, ACM SoCC 2015
CountryUnited States
CityKohala Coast
Period8/27/158/29/15

Fingerprint

Recovery
Costs
Zero
Graph in graph theory
Processing
Servers
Server
Checkpointing
Inaccurate
Replication
Completeness
Trade-offs
Integrate
Scenarios
Experiments
Vertex of a graph
Demonstrate
Experiment

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Theoretical Computer Science

Cite this

Pundir, M., Leslie, L. M., Gupta, I., & Campbell, R. H. (2015). Zorro: zero-cost reactive failure recovery in distributed graph processing. In M. Balazinska, M. J. Freedman, S. Barahmand, & S. Ghandeharizadeh (Eds.), ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing (pp. 195-208). (ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing). Association for Computing Machinery, Inc. https://doi.org/10.1145/2806777.2806934

Zorro : zero-cost reactive failure recovery in distributed graph processing. / Pundir, Mayank; Leslie, Luke M.; Gupta, Indranil; Campbell, R H.

ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing. ed. / Magdalena Balazinska; Michael J. Freedman; Sumita Barahmand; Shahram Ghandeharizadeh. Association for Computing Machinery, Inc, 2015. p. 195-208 (ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pundir, M, Leslie, LM, Gupta, I & Campbell, RH 2015, Zorro: zero-cost reactive failure recovery in distributed graph processing. in M Balazinska, MJ Freedman, S Barahmand & S Ghandeharizadeh (eds), ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing. ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing, Association for Computing Machinery, Inc, pp. 195-208, 6th ACM Symposium on Cloud Computing, ACM SoCC 2015, Kohala Coast, United States, 8/27/15. https://doi.org/10.1145/2806777.2806934
Pundir M, Leslie LM, Gupta I, Campbell RH. Zorro: zero-cost reactive failure recovery in distributed graph processing. In Balazinska M, Freedman MJ, Barahmand S, Ghandeharizadeh S, editors, ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing. Association for Computing Machinery, Inc. 2015. p. 195-208. (ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing). https://doi.org/10.1145/2806777.2806934
Pundir, Mayank ; Leslie, Luke M. ; Gupta, Indranil ; Campbell, R H. / Zorro : zero-cost reactive failure recovery in distributed graph processing. ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing. editor / Magdalena Balazinska ; Michael J. Freedman ; Sumita Barahmand ; Shahram Ghandeharizadeh. Association for Computing Machinery, Inc, 2015. pp. 195-208 (ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing).
@inproceedings{73888f410c5049dd8eca05c92a86ab63,
title = "Zorro: zero-cost reactive failure recovery in distributed graph processing",
abstract = "Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems - PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99{\%} of the graph state when 6-12{\%} of the servers fail, and between 87- 95{\%} when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97{\%}.",
author = "Mayank Pundir and Leslie, {Luke M.} and Indranil Gupta and Campbell, {R H}",
year = "2015",
month = "8",
day = "27",
doi = "10.1145/2806777.2806934",
language = "English (US)",
series = "ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing",
publisher = "Association for Computing Machinery, Inc",
pages = "195--208",
editor = "Magdalena Balazinska and Freedman, {Michael J.} and Sumita Barahmand and Shahram Ghandeharizadeh",
booktitle = "ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing",

}

TY - GEN

T1 - Zorro

T2 - zero-cost reactive failure recovery in distributed graph processing

AU - Pundir, Mayank

AU - Leslie, Luke M.

AU - Gupta, Indranil

AU - Campbell, R H

PY - 2015/8/27

Y1 - 2015/8/27

N2 - Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems - PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6-12% of the servers fail, and between 87- 95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.

AB - Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems - PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6-12% of the servers fail, and between 87- 95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.

UR - http://www.scopus.com/inward/record.url?scp=84959017407&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84959017407&partnerID=8YFLogxK

U2 - 10.1145/2806777.2806934

DO - 10.1145/2806777.2806934

M3 - Conference contribution

AN - SCOPUS:84959017407

T3 - ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing

SP - 195

EP - 208

BT - ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing

A2 - Balazinska, Magdalena

A2 - Freedman, Michael J.

A2 - Barahmand, Sumita

A2 - Ghandeharizadeh, Shahram

PB - Association for Computing Machinery, Inc

ER -