TY - GEN
T1 - Rebound
T2 - 38th Annual International Symposium on Computer Architecture, ISCA'11
AU - Agarwal, Rishi
AU - Garg, Pranav
AU - Torrellas, Josep
PY - 2011
Y1 - 2011
N2 - As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.
AB - As we move to large manycores, the hardware-based global checkpointing schemes that have been proposed for small shared-memory machines do not scale. Scalability barriers include global operations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads. Scalable checkpointing requires tracking inter-thread dependences and building the checkpoint and rollback operations around dynamic groups of communicating processors. To address this problem, this paper introduces Rebound, the first hardware-based scheme for coordinated local checkpointing in multiprocessors with directory-based cache coherence. Rebound leverages the transactions of a directory protocol to track inter-thread dependences. In addition, it boosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at checkpoints, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing at barrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing and rollback sets of processors. Simulations of parallel programs with up to 64 threads show that Rebound is scalable and has very low overhead. For 64 processors, its average performance overhead is only 2%, compared to 15% for global checkpointing.
KW - Faults
KW - Scalable checkpointing
KW - Shared-memory multiprocessors
UR - http://www.scopus.com/inward/record.url?scp=80052552708&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80052552708&partnerID=8YFLogxK
U2 - 10.1145/2000064.2000083
DO - 10.1145/2000064.2000083
M3 - Conference contribution
AN - SCOPUS:80052552708
SN - 9781450304726
T3 - Proceedings - International Symposium on Computer Architecture
SP - 153
EP - 164
BT - Proceeding of the 38th Annual International Symposium on Computer Architecture, ISCA'11
Y2 - 4 June 2011 through 8 June 2011
ER -