A methodology for automatically detecting symptoms of frequently occurring errors in large computer systems is developed. The proposed symptom recognition methodology and its validation are based on probabilistic techniques. The technique is shown to work on real failure data from two CYBER systems at the University of Illinois. The methodology allows for the resolution between independent and dependent causes and, also quantifies a measure of the strength of relationship among the errors. Comparison made with failure/repair information obtained from field maintenance engineers shows that, in 85% of the cases, the error symptoms recognized by this approach correspond to real system problems. The remaining 15%, although not directly supported by field data, were confirmed as valid problems. Some of these were shown to be persistent problems which otherwise would have been considered as minor transients and hence ignored.
|Original language||English (US)|
|Title of host publication||Unknown Host Publication Title|
|Editors||Harold S. Stone|
|Number of pages||10|
|State||Published - Dec 1 1986|
ASJC Scopus subject areas