Taming of the shrew: Modeling the normal and faulty behaviour of large-scale HPC systems

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

HPC systems are complex machines that generate a huge volume of system state data called "events". Events are generated without following a general consistent rule and different hardware and software components of such systems have different failure rates. Distinguishing between normal system behaviour and faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for system administration and is the foundation of fault prediction. As HPC systems continue to grow in size and complexity, mining event flows become more challenging and with the upcoming 10 Pet flop systems, there is a lot of interestin this topic. Current event mining approaches do not take into consideration the specific behaviour of each type of events and as a consequence, fail to analyze them according to their characteristics. In this paper we propose a novel way of characterizing the normal and faulty behaviour of the system by using signal analysis concepts. All analysis modules create ELSA (Event Log Signal Analyzer), a toolkit that has the purpose of modelling the normal flow of each state event during a HPC system lifetime, and how it is affected when a failure hits the system. We show that these extracted models provide an accurate view of the system output, which improves the effectiveness of proactive fault tolerance algorithms. Specifically, we implemented a filtering algorithm and short-term fault prediction methodology based on the extracted model and test it against real failure traces from a large-scale system. We show that by analyzing each event according to its specific behaviour, we get a more realistic overview of the entire system.

Original languageEnglish (US)
Title of host publicationProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Pages1168-1179
Number of pages12
DOIs
StatePublished - 2012
Event2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012 - Shanghai, China
Duration: May 21 2012May 25 2012

Publication series

NameProceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012

Other

Other2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Country/TerritoryChina
CityShanghai
Period5/21/125/25/12

Keywords

  • fault detection
  • fault tolerance
  • large-scale HPC systems
  • signal analysis

ASJC Scopus subject areas

  • Software

Fingerprint

Dive into the research topics of 'Taming of the shrew: Modeling the normal and faulty behaviour of large-scale HPC systems'. Together they form a unique fingerprint.

Cite this