A 'cool' way of improving the reliability of HPC machines

Osman Sarood, Esteban Meneses, Laxmikant V Kale

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.

Original languageEnglish (US)
Title of host publicationProceedings of SC 2013
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Print)9781450323789
DOIs
StatePublished - Jan 1 2013
Event2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013 - Denver, CO, United States
Duration: Nov 17 2013Nov 22 2013

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Other

Other2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013
CountryUnited States
CityDenver, CO
Period11/17/1311/22/13

Fingerprint

Energy utilization
Hardware
Machine components
Supercomputers
Looms
Fault tolerance
Resource allocation
Temperature

Keywords

  • Actionable modeling
  • Checkpointing restart
  • Energy minimization
  • Fault tolerance
  • Load balancing
  • Temperature capping
  • Temperature thresholds
  • Thermal control

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Cite this

Sarood, O., Meneses, E., & Kale, L. V. (2013). A 'cool' way of improving the reliability of HPC machines. In Proceedings of SC 2013: The International Conference for High Performance Computing, Networking, Storage and Analysis [58] (International Conference for High Performance Computing, Networking, Storage and Analysis, SC). IEEE Computer Society. https://doi.org/10.1145/2503210.2503228

A 'cool' way of improving the reliability of HPC machines. / Sarood, Osman; Meneses, Esteban; Kale, Laxmikant V.

Proceedings of SC 2013: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2013. 58 (International Conference for High Performance Computing, Networking, Storage and Analysis, SC).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Sarood, O, Meneses, E & Kale, LV 2013, A 'cool' way of improving the reliability of HPC machines. in Proceedings of SC 2013: The International Conference for High Performance Computing, Networking, Storage and Analysis., 58, International Conference for High Performance Computing, Networking, Storage and Analysis, SC, IEEE Computer Society, 2013 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2013, Denver, CO, United States, 11/17/13. https://doi.org/10.1145/2503210.2503228
Sarood O, Meneses E, Kale LV. A 'cool' way of improving the reliability of HPC machines. In Proceedings of SC 2013: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society. 2013. 58. (International Conference for High Performance Computing, Networking, Storage and Analysis, SC). https://doi.org/10.1145/2503210.2503228
Sarood, Osman ; Meneses, Esteban ; Kale, Laxmikant V. / A 'cool' way of improving the reliability of HPC machines. Proceedings of SC 2013: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2013. (International Conference for High Performance Computing, Networking, Storage and Analysis, SC).
@inproceedings{f47cc7396ef44f76be47f24247bd0936,
title = "A 'cool' way of improving the reliability of HPC machines",
abstract = "Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12{\%}. In addition, our scheme can also reduce machine energy consumption by as much as 25{\%}. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1{\%} efficiency), whereas our validated model predicts an efficiency of 20{\%} by improving the machine reliability by a factor of up to 2.29.",
keywords = "Actionable modeling, Checkpointing restart, Energy minimization, Fault tolerance, Load balancing, Temperature capping, Temperature thresholds, Thermal control",
author = "Osman Sarood and Esteban Meneses and Kale, {Laxmikant V}",
year = "2013",
month = "1",
day = "1",
doi = "10.1145/2503210.2503228",
language = "English (US)",
isbn = "9781450323789",
series = "International Conference for High Performance Computing, Networking, Storage and Analysis, SC",
publisher = "IEEE Computer Society",
booktitle = "Proceedings of SC 2013",

}

TY - GEN

T1 - A 'cool' way of improving the reliability of HPC machines

AU - Sarood, Osman

AU - Meneses, Esteban

AU - Kale, Laxmikant V

PY - 2013/1/1

Y1 - 2013/1/1

N2 - Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.

AB - Soaring energy consumption, accompanied by declining reliability, together loom as the biggest hurdles for the next generation of supercomputers. Recent reports have expressed concern that reliability at exascale level could degrade to the point where failures become a norm rather than an exception. HPC researchers are focusing on improving existing fault tolerance protocols to address these concerns. Research on improving hardware reliability, i.e., machine component reliability, has also been making progress independently. In this paper, we try to bridge this gap and explore the potential of combining both software and hardware aspects towards improving reliability of HPC machines. Fault rates are known to double for every 10°C rise in core temperature. We leverage this notion to experimentally demonstrate the potential of restraining core temperatures and load balancing to achieve two-fold benefits: improving reliability of parallel machines and reducing total execution time required by applications. Our experimental results show that we can improve the reliability of a machine by a factor of 2.3 and reduce the execution time by 12%. In addition, our scheme can also reduce machine energy consumption by as much as 25%. For a 350K socket machine, regular checkpoint/restart fails to make progress (less than 1% efficiency), whereas our validated model predicts an efficiency of 20% by improving the machine reliability by a factor of up to 2.29.

KW - Actionable modeling

KW - Checkpointing restart

KW - Energy minimization

KW - Fault tolerance

KW - Load balancing

KW - Temperature capping

KW - Temperature thresholds

KW - Thermal control

UR - http://www.scopus.com/inward/record.url?scp=84899668006&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84899668006&partnerID=8YFLogxK

U2 - 10.1145/2503210.2503228

DO - 10.1145/2503210.2503228

M3 - Conference contribution

AN - SCOPUS:84899668006

SN - 9781450323789

T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC

BT - Proceedings of SC 2013

PB - IEEE Computer Society

ER -