Measurement-based analysis of multiple latent errors and near-coincident fault discovery in a shared memory multiprocessor

S. G. Mitra, R. K. Iyer

Research output: Contribution to journalConference article

Abstract

A methodology to study multiple latent errors and near-coincident fault discovery in the memory of a shared-memory multiprocessor is presented. The delay between the generation of an error due to a fault and its detection (error latency) can cause multiple latent errors and near-coincident fault discovery in a system. The latter effect is widely known to be catastrophic to the continued operation of a system even in highly fault-tolerant systems. The methodology is illustrated on the Alliant FX/8 under real concurrent workload conditions over a five-day period. The authors found that for a conservative error rate of one error per day, one out of four errors may manifest itself as a multiple latent error. At the same error rate, 8% of the error discoveries are near-coincident in nature for a time-window size of 50 μm (approximately 250 instruction cycles). A strong correlation between existences of multiple latent errors and their near-coincident discovery is quantified.

Original languageEnglish (US)
Pages (from-to)404-409
Number of pages6
JournalProceedings of the International Conference on Parallel Processing
Volume1
StatePublished - Dec 1 1988

Fingerprint

Data storage equipment
Error detection

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

@article{ba419047288a40cab852b36af1bdad74,
title = "Measurement-based analysis of multiple latent errors and near-coincident fault discovery in a shared memory multiprocessor",
abstract = "A methodology to study multiple latent errors and near-coincident fault discovery in the memory of a shared-memory multiprocessor is presented. The delay between the generation of an error due to a fault and its detection (error latency) can cause multiple latent errors and near-coincident fault discovery in a system. The latter effect is widely known to be catastrophic to the continued operation of a system even in highly fault-tolerant systems. The methodology is illustrated on the Alliant FX/8 under real concurrent workload conditions over a five-day period. The authors found that for a conservative error rate of one error per day, one out of four errors may manifest itself as a multiple latent error. At the same error rate, 8{\%} of the error discoveries are near-coincident in nature for a time-window size of 50 μm (approximately 250 instruction cycles). A strong correlation between existences of multiple latent errors and their near-coincident discovery is quantified.",
author = "Mitra, {S. G.} and Iyer, {R. K.}",
year = "1988",
month = "12",
day = "1",
language = "English (US)",
volume = "1",
pages = "404--409",
journal = "Proceedings of the International Conference on Parallel Processing",
issn = "0190-3918",

}

TY - JOUR

T1 - Measurement-based analysis of multiple latent errors and near-coincident fault discovery in a shared memory multiprocessor

AU - Mitra, S. G.

AU - Iyer, R. K.

PY - 1988/12/1

Y1 - 1988/12/1

N2 - A methodology to study multiple latent errors and near-coincident fault discovery in the memory of a shared-memory multiprocessor is presented. The delay between the generation of an error due to a fault and its detection (error latency) can cause multiple latent errors and near-coincident fault discovery in a system. The latter effect is widely known to be catastrophic to the continued operation of a system even in highly fault-tolerant systems. The methodology is illustrated on the Alliant FX/8 under real concurrent workload conditions over a five-day period. The authors found that for a conservative error rate of one error per day, one out of four errors may manifest itself as a multiple latent error. At the same error rate, 8% of the error discoveries are near-coincident in nature for a time-window size of 50 μm (approximately 250 instruction cycles). A strong correlation between existences of multiple latent errors and their near-coincident discovery is quantified.

AB - A methodology to study multiple latent errors and near-coincident fault discovery in the memory of a shared-memory multiprocessor is presented. The delay between the generation of an error due to a fault and its detection (error latency) can cause multiple latent errors and near-coincident fault discovery in a system. The latter effect is widely known to be catastrophic to the continued operation of a system even in highly fault-tolerant systems. The methodology is illustrated on the Alliant FX/8 under real concurrent workload conditions over a five-day period. The authors found that for a conservative error rate of one error per day, one out of four errors may manifest itself as a multiple latent error. At the same error rate, 8% of the error discoveries are near-coincident in nature for a time-window size of 50 μm (approximately 250 instruction cycles). A strong correlation between existences of multiple latent errors and their near-coincident discovery is quantified.

UR - http://www.scopus.com/inward/record.url?scp=0024131169&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=0024131169&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:0024131169

VL - 1

SP - 404

EP - 409

JO - Proceedings of the International Conference on Parallel Processing

JF - Proceedings of the International Conference on Parallel Processing

SN - 0190-3918

ER -