TY - GEN
T1 - Low-overhead diskless checkpoint for hybrid computing systems
AU - Bautista Gomez, Leonardo
AU - Nukada, Akira
AU - Maruyama, Naoya
AU - Cappello, Franck
AU - Matsuoka, Satoshi
PY - 2010
Y1 - 2010
N2 - As the size of new supercomputers scales to tens of thousands of sockets, the mean time between failures (MTBF) is decreasing to just several hours and long executions need some kind of fault tolerance method to survive failures. Checkpoint\Restart is a popular technique used for this purpose; but writing the state of a big scientific application to remote storage will become prohibitively expensive in the near future. Diskless checkpoint was proposed as a solution to avoid the I/O bottleneck of disk-based checkpoint. However, the complex time-consuming encoding techniques hinder its scalability. At the same time, heterogeneous computing is becoming more and more popular in high performance computing (HPC), with new clusters combining CPUs and graphic processing units (GPUs). However, hybrid applications cannot always use all the resources available on the nodes, leaving some idle resources such us GPUs or CPU cores. In this work, we propose a hybrid diskless checkpoint (HDC) technique for GPU-accelerated clusters, that can checkpoint CPU/GPU applications, does not require spare nodes and can tolerate up to 50% of process failures with a low, sometimes negligible, checkpoint overhead.
AB - As the size of new supercomputers scales to tens of thousands of sockets, the mean time between failures (MTBF) is decreasing to just several hours and long executions need some kind of fault tolerance method to survive failures. Checkpoint\Restart is a popular technique used for this purpose; but writing the state of a big scientific application to remote storage will become prohibitively expensive in the near future. Diskless checkpoint was proposed as a solution to avoid the I/O bottleneck of disk-based checkpoint. However, the complex time-consuming encoding techniques hinder its scalability. At the same time, heterogeneous computing is becoming more and more popular in high performance computing (HPC), with new clusters combining CPUs and graphic processing units (GPUs). However, hybrid applications cannot always use all the resources available on the nodes, leaving some idle resources such us GPUs or CPU cores. In this work, we propose a hybrid diskless checkpoint (HDC) technique for GPU-accelerated clusters, that can checkpoint CPU/GPU applications, does not require spare nodes and can tolerate up to 50% of process failures with a low, sometimes negligible, checkpoint overhead.
UR - http://www.scopus.com/inward/record.url?scp=79952794881&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79952794881&partnerID=8YFLogxK
U2 - 10.1109/HIPC.2010.5713163
DO - 10.1109/HIPC.2010.5713163
M3 - Conference contribution
AN - SCOPUS:79952794881
SN - 9781424485185
T3 - 17th International Conference on High Performance Computing, HiPC 2010
BT - 17th International Conference on High Performance Computing, HiPC 2010
PB - IEEE Computer Society
T2 - 17th International Conference on High Performance Computing, HiPC 2010
Y2 - 19 December 2010 through 22 December 2010
ER -