Addressing the last roadblock for message logging in HPC: Alleviating the memory requirement using dedicated resources

Tatiana Martsinkevich, Thomas Ropars, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Currently used global application checkpoint-restart will not be a suitable solution for HPC applications running on large scale as, given the predicted fault rates, it will impose a high load on the I/O subsystem and lead to inefficient resource usage. Combining application checkpointing with message logging is appealing as it allows restarting only the processes that actually failed. One major issue with message logging protocols is the high amount of memory required to store logs. In this work we propose to use additional dedicated resources to save the part of the logs that would not fit in the memory of a compute node. We show that, combined with a cluster-based hierarchical logging technique, only few dedicated nodes would be required to accommodate the memory requirement of message logging protocols. We additionally show that the proposed technique achieves a reasonable performance overhead.

Original languageEnglish (US)
Title of host publicationEuro-Par 2015
Subtitle of host publicationParallel Processing Workshops - Euro-Par 2015 International Workshops, Revised Selected Papers
EditorsSascha Hunold, Josef Weidendorfer, Domingo Gimenez, Laura Ricci, Stefan Lankes, Alexandru Costan, Ana Lucia Varbanescu, Stephen L. Scott, María Engracia Gómez Requena, Vittorio Scarano, Alexandru Iosup, Michael Alexander
PublisherSpringer
Pages644-655
Number of pages12
ISBN (Print)9783319273075
DOIs
StatePublished - 2015
Externally publishedYes
EventInternational Workshops on Parallel Processing Workshops, Euro-Par 2015 - Vienna, Austria
Duration: Aug 24 2015Aug 25 2015

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume9523
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

OtherInternational Workshops on Parallel Processing Workshops, Euro-Par 2015
Country/TerritoryAustria
CityVienna
Period8/24/158/25/15

Keywords

  • Dedicated resources
  • Fault tolerance
  • Hierarchical message-logging protocols
  • High-performance computing
  • Message logging

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Addressing the last roadblock for message logging in HPC: Alleviating the memory requirement using dedicated resources'. Together they form a unique fingerprint.

Cite this