ECOFIT: A framework to estimate energy consumption of fault tolerance protocols for HPC applications

Mohammed El Mehdi Diouri, Olivier Glück, Laurent Lefèvre, Franck Cappello

Research output: Contribution to conferencePaperpeer-review

Abstract

Energy consumption and fault tolerance are two interrelated issues to address for designing future exascale systems. Fault tolerance protocols used for checkpointing have different energy consumption depending on parameters like application features, number of processes in the execution and platform characteristics. Currently, the only way to select a protocol for a given execution is to run the application and monitor the energy consumption of different fault tolerance protocols. This is needed for any variation of the execution setting. To avoid this time and energy consuming process, we propose an energy estimation framework. It relies on an energy calibration of the considered platform and a user description of the execution setting. We evaluate the accuracy of our estimations with real applications running on a real platform with energy consumption monitoring. Results show that our estimations are highly accurate and allow selecting the best fault tolerant protocol without pre-executing the application.

Original languageEnglish (US)
Pages522-529
Number of pages8
DOIs
StatePublished - 2013
Event13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2013 - Delft, Netherlands
Duration: May 13 2013May 16 2013

Other

Other13th IEEE/ACM International Symposium on Cluster, Cloud, and Grid Computing, CCGrid 2013
Country/TerritoryNetherlands
CityDelft
Period5/13/135/16/13

Keywords

  • Checkpoint/restart
  • Energy consumption
  • Estimation
  • Fault tolerance protocols
  • Performance

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'ECOFIT: A framework to estimate energy consumption of fault tolerance protocols for HPC applications'. Together they form a unique fingerprint.

Cite this