TY - GEN
T1 - SPBC
T2 - 2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
AU - Ropars, Thomas
AU - Martsinkevich, Tatiana V.
AU - Guermouche, Amina
AU - Schiper, André
AU - Cappello, Franck
PY - 2013
Y1 - 2013
N2 - The high failure rate expected for future supercomputers re-quires the design of new fault tolerant solutions. Most check-pointing protocols are designed to work with any message-passing application but suffer from scalability issues at ex-treme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order re-lation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we de-sign a protocol that combines an unprecedented set of fea-tures. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process check-points, and this, without penalizing recovery performance. Experiments run with a representative set of HPC work-loads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
AB - The high failure rate expected for future supercomputers re-quires the design of new fault tolerant solutions. Most check-pointing protocols are designed to work with any message-passing application but suffer from scalability issues at ex-treme scale. We take a different approach: We identify a property common to many HPC applications, namely channel-determinism, and introduce a new partial order re-lation, called always-happens-before relation, between events of such applications. Leveraging these two concepts, we de-sign a protocol that combines an unprecedented set of fea-tures. Our protocol called SPBC combines in a hierarchical way coordinated checkpointing and message logging. It is the first protocol that provides failure containment without logging any information reliably apart from process check-points, and this, without penalizing recovery performance. Experiments run with a representative set of HPC work-loads demonstrate a good performance of our protocol during both, failure-free execution and recovery.
UR - http://www.scopus.com/inward/record.url?scp=84899691000&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84899691000&partnerID=8YFLogxK
U2 - 10.1145/2503210.2503271
DO - 10.1145/2503210.2503271
M3 - Conference contribution
AN - SCOPUS:84899691000
SN - 9781450323789
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2013
PB - IEEE Computer Society
Y2 - 17 November 2013 through 22 November 2013
ER -