Energy considerations in checkpointing and fault tolerance protocols

M. El Mehdi Diouri, Olivier Glück, Laurent Lefevre, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Exascale supercomputers will gather hundreds millions cores. The first problem that we address is resiliency and fault tolerance to reach application termination on such platforms. The second problem is energy consumption since such systems will consume enormous amount of energy. In this paper, we evaluate checkpointing and existing fault tolerance protocols from an energy point of view. We measure on a real testbed the power consumption of the main atomic operations found in these protocols. The first results show that process coordination and RAM consume more power than checkpointing and HDD logging. However, the results we presented in Joules per Bytes for I/O operations, emphasize that checkpointing and HDD logging consume more energy than RAM logging. Finally, we propose to consider energy consumption as a criterion for the choice of fault tolerance protocols. In terms of energy consumption, we should promote message logging for applications exchanging small volumes of data and coordination for applications involving few processes.

Original languageEnglish (US)
Title of host publication2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops, DSN-W 2012
DOIs
StatePublished - 2012
Event2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops, DSN-W 2012 - Boston, MA, United States
Duration: Jun 25 2012Jun 28 2012

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks

Other

Other2012 IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops, DSN-W 2012
Country/TerritoryUnited States
CityBoston, MA
Period6/25/126/28/12

Keywords

  • Checkpointing
  • Energy consumption
  • Evaluation
  • Fault tolerance protocols

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Energy considerations in checkpointing and fault tolerance protocols'. Together they form a unique fingerprint.

Cite this