Assessing energy efficiency of fault tolerance protocols for HPC systems

Esteban Meneses, Osman Sarood, Laxmikant V Kale

Research output: Contribution to journalConference article

Abstract

An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.

Original languageEnglish (US)
Article number6374769
Pages (from-to)35-42
Number of pages8
JournalProceedings - Symposium on Computer Architecture and High Performance Computing
DOIs
StatePublished - Dec 1 2012
Event24th International Symposium on Computer Architecture and High Performance Computing, SBAC-PAD 2012 - New York, NY, United States
Duration: Oct 24 2012Oct 26 2012

Fingerprint

Fault tolerance
Energy efficiency
Recovery
Energy utilization
Analytical models

Keywords

  • energy efficiency
  • fault tolerance

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software

Cite this

Assessing energy efficiency of fault tolerance protocols for HPC systems. / Meneses, Esteban; Sarood, Osman; Kale, Laxmikant V.

In: Proceedings - Symposium on Computer Architecture and High Performance Computing, 01.12.2012, p. 35-42.

Research output: Contribution to journalConference article

@article{0718bbd67deb429791119f705c046eac,
title = "Assessing energy efficiency of fault tolerance protocols for HPC systems",
abstract = "An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17{\%} and reduce the energy consumption by 13{\%} when compared to checkpoint/restart.",
keywords = "energy efficiency, fault tolerance",
author = "Esteban Meneses and Osman Sarood and Kale, {Laxmikant V}",
year = "2012",
month = "12",
day = "1",
doi = "10.1109/SBAC-PAD.2012.12",
language = "English (US)",
pages = "35--42",
journal = "Proceedings - Symposium on Computer Architecture and High Performance Computing",
issn = "1550-6533",

}

TY - JOUR

T1 - Assessing energy efficiency of fault tolerance protocols for HPC systems

AU - Meneses, Esteban

AU - Sarood, Osman

AU - Kale, Laxmikant V

PY - 2012/12/1

Y1 - 2012/12/1

N2 - An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.

AB - An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.

KW - energy efficiency

KW - fault tolerance

UR - http://www.scopus.com/inward/record.url?scp=84871643381&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84871643381&partnerID=8YFLogxK

U2 - 10.1109/SBAC-PAD.2012.12

DO - 10.1109/SBAC-PAD.2012.12

M3 - Conference article

AN - SCOPUS:84871643381

SP - 35

EP - 42

JO - Proceedings - Symposium on Computer Architecture and High Performance Computing

JF - Proceedings - Symposium on Computer Architecture and High Performance Computing

SN - 1550-6533

M1 - 6374769

ER -