Abstract
A methodology to study multiple latent errors and near-coincident fault discovery in the memory of a shared-memory multiprocessor is presented. The delay between the generation of an error due to a fault and its detection (error latency) can cause multiple latent errors and near-coincident fault discovery in a system. The latter effect is widely known to be catastrophic to the continued operation of a system even in highly fault-tolerant systems. The methodology is illustrated on the Alliant FX/8 under real concurrent workload conditions over a five-day period. The authors found that for a conservative error rate of one error per day, one out of four errors may manifest itself as a multiple latent error. At the same error rate, 8% of the error discoveries are near-coincident in nature for a time-window size of 50 μm (approximately 250 instruction cycles). A strong correlation between existences of multiple latent errors and their near-coincident discovery is quantified.
Original language | English (US) |
---|---|
Pages (from-to) | 404-409 |
Number of pages | 6 |
Journal | Proceedings of the International Conference on Parallel Processing |
Volume | 1 |
State | Published - 1988 |
ASJC Scopus subject areas
- Hardware and Architecture