TY - GEN
T1 - Distributed diskless checkpoint for large scale systems
AU - Gomez, Leonardo Bautista
AU - Maruyama, Naoya
AU - Cappello, Franck
AU - Matsuoka, Satoshi
PY - 2010
Y1 - 2010
N2 - In high performance computing (HPC), the applications are periodically checkpointed to stable storage to increase the success rate of long executions. Nowadays, the overhead imposed by disk-based checkpoint is about 20% of execution time and in the next years it will be more than 50% if the checkpoint frequency increases as the fault frequency increases. Diskless checkpoint has been introduced as a solution to avoid the IO bottleneck of disk-based checkpoint. However, the encoding time, the dedicated resources (the spares) and the memory overhead imposed by diskless checkpoint are significant obstacles against its adoption. In this work, we address these three limitations: 1) we propose a fault tolerant model able to tolerate up to 50% of process failures with a low checkpointing overhead 2) our fault tolerance model works without spare node, while still guarantying high reliability, 3) we use solid state drives to significantly increase the checkpoint performance and avoid the memory overhead of classic diskless checkpoint.
AB - In high performance computing (HPC), the applications are periodically checkpointed to stable storage to increase the success rate of long executions. Nowadays, the overhead imposed by disk-based checkpoint is about 20% of execution time and in the next years it will be more than 50% if the checkpoint frequency increases as the fault frequency increases. Diskless checkpoint has been introduced as a solution to avoid the IO bottleneck of disk-based checkpoint. However, the encoding time, the dedicated resources (the spares) and the memory overhead imposed by diskless checkpoint are significant obstacles against its adoption. In this work, we address these three limitations: 1) we propose a fault tolerant model able to tolerate up to 50% of process failures with a low checkpointing overhead 2) our fault tolerance model works without spare node, while still guarantying high reliability, 3) we use solid state drives to significantly increase the checkpoint performance and avoid the memory overhead of classic diskless checkpoint.
UR - http://www.scopus.com/inward/record.url?scp=77954904463&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77954904463&partnerID=8YFLogxK
U2 - 10.1109/CCGRID.2010.40
DO - 10.1109/CCGRID.2010.40
M3 - Conference contribution
AN - SCOPUS:77954904463
SN - 9781424469871
T3 - CCGrid 2010 - 10th IEEE/ACM International Conference on Cluster, Cloud, and Grid Computing
SP - 63
EP - 72
BT - CCGrid 2010 - 10th IEEE/ACM International Conference on Cluster, Cloud, and Grid Computing
T2 - 10th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2010
Y2 - 17 May 2010 through 20 May 2010
ER -