Measurement-based analysis of multiple latent errors and near-coincident fault discovery in a shared memory multiprocessor

S. G. Mitra, R. K. Iyer

Research output: Contribution to journalConference articlepeer-review

Abstract

A methodology to study multiple latent errors and near-coincident fault discovery in the memory of a shared-memory multiprocessor is presented. The delay between the generation of an error due to a fault and its detection (error latency) can cause multiple latent errors and near-coincident fault discovery in a system. The latter effect is widely known to be catastrophic to the continued operation of a system even in highly fault-tolerant systems. The methodology is illustrated on the Alliant FX/8 under real concurrent workload conditions over a five-day period. The authors found that for a conservative error rate of one error per day, one out of four errors may manifest itself as a multiple latent error. At the same error rate, 8% of the error discoveries are near-coincident in nature for a time-window size of 50 μm (approximately 250 instruction cycles). A strong correlation between existences of multiple latent errors and their near-coincident discovery is quantified.

Original languageEnglish (US)
Pages (from-to)404-409
Number of pages6
JournalProceedings of the International Conference on Parallel Processing
Volume1
StatePublished - 1988

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Measurement-based analysis of multiple latent errors and near-coincident fault discovery in a shared memory multiprocessor'. Together they form a unique fingerprint.

Cite this