TY - GEN
T1 - Addressing the last roadblock for message logging in HPC
T2 - International Workshops on Parallel Processing Workshops, Euro-Par 2015
AU - Martsinkevich, Tatiana
AU - Ropars, Thomas
AU - Cappello, Franck
N1 - Funding Information:
Experiments presented in this paper were carried out using the Grid?5000 experimental testbed, being developed under the INRIA ALADDIN development action with support from CNRS, RENATER and several Universities as well as other funding bodies (see https://www.grid5000.fr). This material is based upon work supported by the U.S. Department of Energy, Office of Science, under contract DE-AC02-06CH11357. The submitted manuscript has been created by UChicago Argonne, LLC, Operator of Argonne National Laboratory (?Argonne?). Argonne, a U.S. Department of Energy Office of Science laboratory, is operated under Contract No. DE-AC02-06CH11357. The U.S. Government retains for itself, and others acting on its behalf, a paid-up nonexclusive, irrevocable worldwide license in said article to reproduce, prepare derivative works, distribute copies to the public, and perform publicly and display publicly, by or on behalf of the Government.
Publisher Copyright:
© Springer International Publishing Switzerland 2015.
PY - 2015
Y1 - 2015
N2 - Currently used global application checkpoint-restart will not be a suitable solution for HPC applications running on large scale as, given the predicted fault rates, it will impose a high load on the I/O subsystem and lead to inefficient resource usage. Combining application checkpointing with message logging is appealing as it allows restarting only the processes that actually failed. One major issue with message logging protocols is the high amount of memory required to store logs. In this work we propose to use additional dedicated resources to save the part of the logs that would not fit in the memory of a compute node. We show that, combined with a cluster-based hierarchical logging technique, only few dedicated nodes would be required to accommodate the memory requirement of message logging protocols. We additionally show that the proposed technique achieves a reasonable performance overhead.
AB - Currently used global application checkpoint-restart will not be a suitable solution for HPC applications running on large scale as, given the predicted fault rates, it will impose a high load on the I/O subsystem and lead to inefficient resource usage. Combining application checkpointing with message logging is appealing as it allows restarting only the processes that actually failed. One major issue with message logging protocols is the high amount of memory required to store logs. In this work we propose to use additional dedicated resources to save the part of the logs that would not fit in the memory of a compute node. We show that, combined with a cluster-based hierarchical logging technique, only few dedicated nodes would be required to accommodate the memory requirement of message logging protocols. We additionally show that the proposed technique achieves a reasonable performance overhead.
KW - Dedicated resources
KW - Fault tolerance
KW - Hierarchical message-logging protocols
KW - High-performance computing
KW - Message logging
UR - http://www.scopus.com/inward/record.url?scp=84951963929&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84951963929&partnerID=8YFLogxK
U2 - 10.1007/978-3-319-27308-2_52
DO - 10.1007/978-3-319-27308-2_52
M3 - Conference contribution
AN - SCOPUS:84951963929
SN - 9783319273075
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 644
EP - 655
BT - Euro-Par 2015
A2 - Hunold, Sascha
A2 - Weidendorfer, Josef
A2 - Gimenez, Domingo
A2 - Ricci, Laura
A2 - Lankes, Stefan
A2 - Costan, Alexandru
A2 - Varbanescu, Ana Lucia
A2 - Scott, Stephen L.
A2 - Requena, María Engracia Gómez
A2 - Scarano, Vittorio
A2 - Iosup, Alexandru
A2 - Alexander, Michael
PB - Springer
Y2 - 24 August 2015 through 25 August 2015
ER -