Module prototype for online failure prediction for the IBM blue Gene/L

Lizandro D. Solano-Quinde, Brett M. Bode

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system. Although reactive fault tolerant policies effectively minimize the effects of faults, it has been shown that these techniques drastically reduce the system performance. Proactive fault tolerant policies have emerged as an alternative due to the reduced performance degradation they impose. Proactive fault tolerant policies are based on the analysis of information about the state of the system. The monitoring system of the IBM Blue Gene/L generates online information about the state of hardware and software of the system and stores that information in the RAS event log. In this study, we design and implement a module prototype for online failure prediction. This prototype is tested and validated, on a realistic scenario, using the RAS event log of an IBM Blue Gene/L system. We show that our module prototype for failure prediction predicts up to 70% of the fatal events.

Original languageEnglish (US)
Title of host publication2008 IEEE International Conference on Electro/Information Technology, IEEE EIT 2008 Conference
Pages470-474
Number of pages5
DOIs
StatePublished - 2008
Externally publishedYes
Event2008 IEEE International Conference on Electro/Information Technology, IEEE EIT 2008 Conference - Ames, IA, United States
Duration: May 18 2008May 20 2008

Publication series

Name2008 IEEE International Conference on Electro/Information Technology, IEEE EIT 2008 Conference

Other

Other2008 IEEE International Conference on Electro/Information Technology, IEEE EIT 2008 Conference
Country/TerritoryUnited States
CityAmes, IA
Period5/18/085/20/08

Keywords

  • Blue Gene/L
  • Computer Fault Tolerance
  • Failure analysis
  • Software fault tolerance

ASJC Scopus subject areas

  • Information Systems and Management
  • Electrical and Electronic Engineering
  • Communication

Fingerprint

Dive into the research topics of 'Module prototype for online failure prediction for the IBM blue Gene/L'. Together they form a unique fingerprint.

Cite this