TY - GEN
T1 - Zorro
T2 - 6th ACM Symposium on Cloud Computing, ACM SoCC 2015
AU - Pundir, Mayank
AU - Leslie, Luke M.
AU - Gupta, Indranil
AU - Campbell, Roy H.
N1 - Publisher Copyright:
©2015 ACM.
PY - 2015/8/27
Y1 - 2015/8/27
N2 - Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems - PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6-12% of the servers fail, and between 87- 95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.
AB - Distributed graph processing systems largely rely on proactive techniques for failure recovery. Unfortunately, these approaches (such as checkpointing) entail a significant overhead. In this paper, we argue that distributed graph processing systems should instead use a reactive approach to failure recovery. The reactive approach trades off completeness of the result (generating a slightly inaccurate result) while reducing the overhead during failure-free execution to zero. We build a system called Zorro that imbues this reactive approach, and integrate Zorro into two graph processing systems - PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication inherent in today's graph processing systems to quickly rebuild the state of failed servers. Experiments using real-world graphs demonstrate that Zorro is able to recover over 99% of the graph state when 6-12% of the servers fail, and between 87- 95% when half the cluster fails. Furthermore, using various graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios, and achieves a worst-case accuracy of 97%.
UR - http://www.scopus.com/inward/record.url?scp=84959017407&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84959017407&partnerID=8YFLogxK
U2 - 10.1145/2806777.2806934
DO - 10.1145/2806777.2806934
M3 - Conference contribution
AN - SCOPUS:84959017407
T3 - ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing
SP - 195
EP - 208
BT - ACM SoCC 2015 - Proceedings of the 6th ACM Symposium on Cloud Computing
A2 - Balazinska, Magdalena
A2 - Freedman, Michael J.
A2 - Barahmand, Sumita
A2 - Ghandeharizadeh, Shahram
PB - Association for Computing Machinery
Y2 - 27 August 2015 through 29 August 2015
ER -