A methodology to study multiple latent errors and near-coincident fault discovery in the memory of a shared-memory multiprocessor is presented. The delay between the generation of an error due to a fault and its detection (error latency) can cause multiple latent errors and near-coincident fault discovery in a system. The latter effect is widely known to be catastrophic to the continued operation of a system even in highly fault-tolerant systems. The methodology is illustrated on the Alliant FX/8 under real concurrent workload conditions over a five-day period. The authors found that for a conservative error rate of one error per day, one out of four errors may manifest itself as a multiple latent error. At the same error rate, 8% of the error discoveries are near-coincident in nature for a time-window size of 50 μm (approximately 250 instruction cycles). A strong correlation between existences of multiple latent errors and their near-coincident discovery is quantified.
|Original language||English (US)|
|Number of pages||6|
|Journal||Proceedings of the International Conference on Parallel Processing|
|State||Published - 1988|
ASJC Scopus subject areas
- Hardware and Architecture