Abstract

This paper proposes a methodology for recognizing the symptoms of persistent problems in large systems. The approach uses the system error rate to identify the error states among which relationships may exist. Statistical techniques are then used to validate and quantify the strength of the relationship among these error states. As input, the approach takes the raw error logs containing a single entry for each error that was detected as an isolated event. As output, it produces a list of symptoms that characterize persistent errors. Thus, given a failure, we determine whether the failure is an intermittent manifestation of a common fault or whether it is an isolated (transient) incident. The technique is first shown to work on two CYBER systems at the University of Illinois. Comparisons to real failure/repair information obtained from field engineers showed that, in about 85% of the cases, the error symptoms recognized by this approach correspond to real problems. The remaining 15% percent of the cases, although not directly supported by field data, were confirmed as being valid problems. Two of these were long-term persistent problems which had previously gone undiagnosed. The technique is also illustrated on an IBM 3081 multiprocessor system.

Original languageEnglish (US)
Pages (from-to)525-537
Number of pages13
JournalIEEE Transactions on Computers
Volume39
Issue number4
DOIs
StatePublished - Apr 1990

Keywords

  • Automatic recognition
  • diagnosis
  • error logs
  • failure symptoms
  • persistent errors
  • statistical testing

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Automatic Recognition of Intermittent Failures: An Experimental Study of Field Data'. Together they form a unique fingerprint.

Cite this