Reducing Waste in Extreme Scale Systems through Introspective Analysis

Leonardo Bautista-Gomez, Ana Gainaru, Swann Perarnau, Devesh Tiwari, Saurabh Gupta, Christian Engelmann, Franck Cappello, Marc Snir

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Resilience is an important challenge for extreme-scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. Our study of the failure logs of multiple supercomputers show that periods of higher failure density occur with up to three times more than the average. We design a monitoring system that listens to hardware events and forwards important events to the runtime to detect those regime changes. We implement a runtime capable of receiving notifications and adapt dynamically. In addition, we build an analytical model to predict the gains that such dynamic approach could achieve. We demonstrate that in some systems, our approach can reduce the wasted time by over 30%.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages212-221
Number of pages10
ISBN (Electronic)9781509021406
DOIs
StatePublished - Jul 18 2016
Event30th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016 - Chicago, United States
Duration: May 23 2016May 27 2016

Publication series

NameProceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016

Other

Other30th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016
Country/TerritoryUnited States
CityChicago
Period5/23/165/27/16

Keywords

  • Fault Tolerance
  • Introspective Systems
  • Resilience
  • Silent Data Corruption
  • Soft Errors
  • Supercomputers

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Reducing Waste in Extreme Scale Systems through Introspective Analysis'. Together they form a unique fingerprint.

Cite this