Improving Log-based Field Failure Data Analysis of multi-node computing systems

Antonio Pecchia, Domenico Cotroneo, Zbigniew Kalbarczyk, Ravishankar K. Iyer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.

Original languageEnglish (US)
Title of host publication2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks, DSN 2011
Pages97-108
Number of pages12
DOIs
StatePublished - Aug 26 2011
Event2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks, DSN 2011 - Hong Kong, Hong Kong
Duration: Jun 27 2011Jun 30 2011

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks

Other

Other2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks, DSN 2011
CountryHong Kong
CityHong Kong
Period6/27/116/30/11

Keywords

  • Field Failure Data Analysis
  • collision
  • dependability measurements
  • supercomputer
  • tuple heuristic

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Pecchia, A., Cotroneo, D., Kalbarczyk, Z., & Iyer, R. K. (2011). Improving Log-based Field Failure Data Analysis of multi-node computing systems. In 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks, DSN 2011 (pp. 97-108). [5958210] (Proceedings of the International Conference on Dependable Systems and Networks). https://doi.org/10.1109/DSN.2011.5958210