TY - JOUR
T1 - Protocol-aware recovery for consensus-based distributed storage
AU - Alagappan, Ramnatthan
AU - Ganesan, Aishwarya
AU - Lee, Eric
AU - Albarghouthi, Aws
AU - Chidambaram, Vijay
AU - Arpaci-Dusseau, Andrea C.
AU - Arpaci-Dusseau, Remzi H.
N1 - Funding Information:
This material was supported by funding from NSF grants CNS-1421033 and CNS-1218405, DOE grant DE-SC0014935, and donations from EMC, Huawei, Microsoft, and VMware. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and may not reflect the views of NSF, DOE, or other institutions. This article is an extended version of a FAST ’18 paper by Alagappan et al. [7]. The additional material here includes a discussion on how Par can be applied to other systems, a proof of why crashes and corruptions cannot be always disentangled, an overview diagram that summarizes the entire recovery protocol, new performance experiments, new figures explaining leader-initiated snapshots, and many other small updates. Authors’ addresses: R. Alagappan and A. Ganesan, 1210 W. Dayton St., Madison, WI 53706; emails: {ra, ag}@cs.wisc.edu; E. Lee, 2317 Speedway, Austin, TX 78712; email: [email protected]; A. Albarghouthi, 1210 W. Dayton St., Madison, WI 53706; email: [email protected]; V. Chidambaram, 2317 Speedway, Austin, TX 78712; email: [email protected]; A. C. Arpaci-Dusseau and R. H. Arpaci-Dusseau, 1210 W. Dayton St., Madison, WI 53706; emails: [email protected], [email protected]. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]. © 2018 Association for Computing Machinery. 1553-3077/2018/10-ART21 $15.00 https://doi.org/10.1145/3241062
Publisher Copyright:
© 2018 Association for Computing Machinery.
PY - 2018/11
Y1 - 2018/11
N2 - We introduce protocol-aware recovery (Par), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the eficacy of Par through the design and implementation of corruption-tolerant replication (Ctrl), a Par mechanism specific to replicated state machine (RSM) systems. We experimentally show that the Ctrl versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the Ctrl versions achieve this reliability with little performance overheads.
AB - We introduce protocol-aware recovery (Par), a new approach that exploits protocol-specific knowledge to correctly recover from storage faults in distributed systems. We demonstrate the eficacy of Par through the design and implementation of corruption-tolerant replication (Ctrl), a Par mechanism specific to replicated state machine (RSM) systems. We experimentally show that the Ctrl versions of two systems, LogCabin and ZooKeeper, safely recover from storage faults and provide high availability, while the unmodified versions can lose data or become unavailable. We also show that the Ctrl versions achieve this reliability with little performance overheads.
KW - Consensus
KW - Data corruption
KW - Fault tolerance
KW - Storage faults
UR - http://www.scopus.com/inward/record.url?scp=85061192560&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85061192560&partnerID=8YFLogxK
U2 - 10.1145/3241062
DO - 10.1145/3241062
M3 - Article
AN - SCOPUS:85061192560
SN - 1553-3077
VL - 14
JO - ACM Transactions on Storage
JF - ACM Transactions on Storage
IS - 3
M1 - 21
ER -