TY - GEN
T1 - A scalable double in-memory checkpoint and restart scheme towards exascale
AU - Zheng, Gengbin
AU - Xiang, Ni
AU - Kale, Laxmikant V
PY - 2012/12/1
Y1 - 2012/12/1
N2 - As the size of supercomputers increases, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. In previous work, we have demonstrated an efficient double in-memory checkpoint and restart fault tolerance scheme, which leverages Charm++'s parallel objects for checkpointing. In this paper, we further optimize the scheme by eliminating several bottlenecks caused by serialized communication. We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. For example, when running a one million atom molecular dynamics simulation on up to 64K cores of a BlueGene/P machine, the checkpoint time was in milliseconds. The restart time was measured to be less than 0.15 seconds on 64K cores.
AB - As the size of supercomputers increases, the probability of system failure grows substantially, posing an increasingly significant challenge for scalability. It is important to provide resilience for long running applications. Checkpoint-based fault tolerance methods are effective approaches at dealing with faults. With these methods, the state of the entire parallel application is checkpointed to reliable storage. When a failure occurs, the application is restarted from a recent checkpoint. In previous work, we have demonstrated an efficient double in-memory checkpoint and restart fault tolerance scheme, which leverages Charm++'s parallel objects for checkpointing. In this paper, we further optimize the scheme by eliminating several bottlenecks caused by serialized communication. We extend the in-memory checkpointing scheme to work on MPI communication layer, and demonstrate the performance on very large scale supercomputers. For example, when running a one million atom molecular dynamics simulation on up to 64K cores of a BlueGene/P machine, the checkpoint time was in milliseconds. The restart time was measured to be less than 0.15 seconds on 64K cores.
UR - http://www.scopus.com/inward/record.url?scp=84880883327&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84880883327&partnerID=8YFLogxK
U2 - 10.1109/DSNW.2012.6264677
DO - 10.1109/DSNW.2012.6264677
M3 - Conference contribution
AN - SCOPUS:84880883327
SN - 9781467322645
T3 - Proceedings of the International Conference on Dependable Systems and Networks
BT - 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops, DSN-W 2012
T2 - 2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops, DSN-W 2012
Y2 - 25 June 2012 through 28 June 2012
ER -