Accurate microarchitecture-level fault modeling for studying hardware faults

Man Lap Li, Pradeep Ramachandran, Ulya R. Karpuzcu, Siva Kumar Sastry Hari, Sarita V. Adve

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Decreasing hardware reliability is expected to impede the exploitation of increasing integration projected by Moore's Law. There is much ongoing research on efficient fault tolerance mechanisms across all levels of the system stack, from the device level to the system level. High-level fault tolerance solutions, such as at the microarchitecture and system levels, are commonly evaluated using statistical fault injections with microarchitecture-level fault models. Since hardware faults actually manifest at a much lower level, it is unclear if such high level fault models are acceptably accurate. On the other hand, lower level models, such as at the gate level, may be more accurate, but their increased simulation times make it hard to track the system-level propagation of faults. Thus, an evaluation of high-level reliability solutions entails the classical tradeoff between speed and accuracy. This paper seeks to quantify and alleviate this tradeoff. We make the following contributions: (1) We introduce SWAT-Sim, a novel fault injection infrastructure that uses hierarchical simulation to study the system-level manifestations of permanent (and transient) gate-level faults. For our experiments, SWAT-Sim incurs a small average performance overhead of under 3x, for the components we simulate, when compared to pure microarchitectural simulations. (2) We study system-level manifestations of faults injected under different microarchitecture-level and gate-level fault models and identify the reasons for the inability of microarchitecture-level faults to model gate-level faults in general. (3) Based on our analysis, we derive two probabilistic microarchitecture-level fault models to mimic gate-level stuck-at and delay faults. Our results show that these models are, in general, inaccurate as they do not capture the complex manifestation of gate-level faults. The inaccuracies in existing models and the lack of more accurate microarchitecturelevel models motivate using infrastructures similar to SWATSim to faithfully model the microarchitecture-level effects of gate-level faults.

Original languageEnglish (US)
Title of host publicationProceedings - 15th International Symposium on High-Performance Computer Architecture, HPCA - 15 2009
Pages105-116
Number of pages12
DOIs
StatePublished - Apr 24 2009
Event2008 IEEE International Conference on Mechatronics and Automation, ICMA 2008 - Takamatsu, Japan
Duration: Aug 5 2008Aug 8 2008

Publication series

NameProceedings - International Symposium on High-Performance Computer Architecture
ISSN (Print)1530-0897

Other

Other2008 IEEE International Conference on Mechatronics and Automation, ICMA 2008
CountryJapan
CityTakamatsu
Period8/5/088/8/08

Fingerprint

Hardware
Fault tolerance
Experiments

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Li, M. L., Ramachandran, P., Karpuzcu, U. R., Hari, S. K. S., & Adve, S. V. (2009). Accurate microarchitecture-level fault modeling for studying hardware faults. In Proceedings - 15th International Symposium on High-Performance Computer Architecture, HPCA - 15 2009 (pp. 105-116). [4798242] (Proceedings - International Symposium on High-Performance Computer Architecture). https://doi.org/10.1109/HPCA.2009.4798242

Accurate microarchitecture-level fault modeling for studying hardware faults. / Li, Man Lap; Ramachandran, Pradeep; Karpuzcu, Ulya R.; Hari, Siva Kumar Sastry; Adve, Sarita V.

Proceedings - 15th International Symposium on High-Performance Computer Architecture, HPCA - 15 2009. 2009. p. 105-116 4798242 (Proceedings - International Symposium on High-Performance Computer Architecture).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, ML, Ramachandran, P, Karpuzcu, UR, Hari, SKS & Adve, SV 2009, Accurate microarchitecture-level fault modeling for studying hardware faults. in Proceedings - 15th International Symposium on High-Performance Computer Architecture, HPCA - 15 2009., 4798242, Proceedings - International Symposium on High-Performance Computer Architecture, pp. 105-116, 2008 IEEE International Conference on Mechatronics and Automation, ICMA 2008, Takamatsu, Japan, 8/5/08. https://doi.org/10.1109/HPCA.2009.4798242
Li ML, Ramachandran P, Karpuzcu UR, Hari SKS, Adve SV. Accurate microarchitecture-level fault modeling for studying hardware faults. In Proceedings - 15th International Symposium on High-Performance Computer Architecture, HPCA - 15 2009. 2009. p. 105-116. 4798242. (Proceedings - International Symposium on High-Performance Computer Architecture). https://doi.org/10.1109/HPCA.2009.4798242
Li, Man Lap ; Ramachandran, Pradeep ; Karpuzcu, Ulya R. ; Hari, Siva Kumar Sastry ; Adve, Sarita V. / Accurate microarchitecture-level fault modeling for studying hardware faults. Proceedings - 15th International Symposium on High-Performance Computer Architecture, HPCA - 15 2009. 2009. pp. 105-116 (Proceedings - International Symposium on High-Performance Computer Architecture).
@inproceedings{a60f121a491b45dc9df7226bef545e9e,
title = "Accurate microarchitecture-level fault modeling for studying hardware faults",
abstract = "Decreasing hardware reliability is expected to impede the exploitation of increasing integration projected by Moore's Law. There is much ongoing research on efficient fault tolerance mechanisms across all levels of the system stack, from the device level to the system level. High-level fault tolerance solutions, such as at the microarchitecture and system levels, are commonly evaluated using statistical fault injections with microarchitecture-level fault models. Since hardware faults actually manifest at a much lower level, it is unclear if such high level fault models are acceptably accurate. On the other hand, lower level models, such as at the gate level, may be more accurate, but their increased simulation times make it hard to track the system-level propagation of faults. Thus, an evaluation of high-level reliability solutions entails the classical tradeoff between speed and accuracy. This paper seeks to quantify and alleviate this tradeoff. We make the following contributions: (1) We introduce SWAT-Sim, a novel fault injection infrastructure that uses hierarchical simulation to study the system-level manifestations of permanent (and transient) gate-level faults. For our experiments, SWAT-Sim incurs a small average performance overhead of under 3x, for the components we simulate, when compared to pure microarchitectural simulations. (2) We study system-level manifestations of faults injected under different microarchitecture-level and gate-level fault models and identify the reasons for the inability of microarchitecture-level faults to model gate-level faults in general. (3) Based on our analysis, we derive two probabilistic microarchitecture-level fault models to mimic gate-level stuck-at and delay faults. Our results show that these models are, in general, inaccurate as they do not capture the complex manifestation of gate-level faults. The inaccuracies in existing models and the lack of more accurate microarchitecturelevel models motivate using infrastructures similar to SWATSim to faithfully model the microarchitecture-level effects of gate-level faults.",
author = "Li, {Man Lap} and Pradeep Ramachandran and Karpuzcu, {Ulya R.} and Hari, {Siva Kumar Sastry} and Adve, {Sarita V.}",
year = "2009",
month = "4",
day = "24",
doi = "10.1109/HPCA.2009.4798242",
language = "English (US)",
isbn = "9781424429325",
series = "Proceedings - International Symposium on High-Performance Computer Architecture",
pages = "105--116",
booktitle = "Proceedings - 15th International Symposium on High-Performance Computer Architecture, HPCA - 15 2009",

}

TY - GEN

T1 - Accurate microarchitecture-level fault modeling for studying hardware faults

AU - Li, Man Lap

AU - Ramachandran, Pradeep

AU - Karpuzcu, Ulya R.

AU - Hari, Siva Kumar Sastry

AU - Adve, Sarita V.

PY - 2009/4/24

Y1 - 2009/4/24

N2 - Decreasing hardware reliability is expected to impede the exploitation of increasing integration projected by Moore's Law. There is much ongoing research on efficient fault tolerance mechanisms across all levels of the system stack, from the device level to the system level. High-level fault tolerance solutions, such as at the microarchitecture and system levels, are commonly evaluated using statistical fault injections with microarchitecture-level fault models. Since hardware faults actually manifest at a much lower level, it is unclear if such high level fault models are acceptably accurate. On the other hand, lower level models, such as at the gate level, may be more accurate, but their increased simulation times make it hard to track the system-level propagation of faults. Thus, an evaluation of high-level reliability solutions entails the classical tradeoff between speed and accuracy. This paper seeks to quantify and alleviate this tradeoff. We make the following contributions: (1) We introduce SWAT-Sim, a novel fault injection infrastructure that uses hierarchical simulation to study the system-level manifestations of permanent (and transient) gate-level faults. For our experiments, SWAT-Sim incurs a small average performance overhead of under 3x, for the components we simulate, when compared to pure microarchitectural simulations. (2) We study system-level manifestations of faults injected under different microarchitecture-level and gate-level fault models and identify the reasons for the inability of microarchitecture-level faults to model gate-level faults in general. (3) Based on our analysis, we derive two probabilistic microarchitecture-level fault models to mimic gate-level stuck-at and delay faults. Our results show that these models are, in general, inaccurate as they do not capture the complex manifestation of gate-level faults. The inaccuracies in existing models and the lack of more accurate microarchitecturelevel models motivate using infrastructures similar to SWATSim to faithfully model the microarchitecture-level effects of gate-level faults.

AB - Decreasing hardware reliability is expected to impede the exploitation of increasing integration projected by Moore's Law. There is much ongoing research on efficient fault tolerance mechanisms across all levels of the system stack, from the device level to the system level. High-level fault tolerance solutions, such as at the microarchitecture and system levels, are commonly evaluated using statistical fault injections with microarchitecture-level fault models. Since hardware faults actually manifest at a much lower level, it is unclear if such high level fault models are acceptably accurate. On the other hand, lower level models, such as at the gate level, may be more accurate, but their increased simulation times make it hard to track the system-level propagation of faults. Thus, an evaluation of high-level reliability solutions entails the classical tradeoff between speed and accuracy. This paper seeks to quantify and alleviate this tradeoff. We make the following contributions: (1) We introduce SWAT-Sim, a novel fault injection infrastructure that uses hierarchical simulation to study the system-level manifestations of permanent (and transient) gate-level faults. For our experiments, SWAT-Sim incurs a small average performance overhead of under 3x, for the components we simulate, when compared to pure microarchitectural simulations. (2) We study system-level manifestations of faults injected under different microarchitecture-level and gate-level fault models and identify the reasons for the inability of microarchitecture-level faults to model gate-level faults in general. (3) Based on our analysis, we derive two probabilistic microarchitecture-level fault models to mimic gate-level stuck-at and delay faults. Our results show that these models are, in general, inaccurate as they do not capture the complex manifestation of gate-level faults. The inaccuracies in existing models and the lack of more accurate microarchitecturelevel models motivate using infrastructures similar to SWATSim to faithfully model the microarchitecture-level effects of gate-level faults.

UR - http://www.scopus.com/inward/record.url?scp=64949105166&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=64949105166&partnerID=8YFLogxK

U2 - 10.1109/HPCA.2009.4798242

DO - 10.1109/HPCA.2009.4798242

M3 - Conference contribution

AN - SCOPUS:64949105166

SN - 9781424429325

T3 - Proceedings - International Symposium on High-Performance Computer Architecture

SP - 105

EP - 116

BT - Proceedings - 15th International Symposium on High-Performance Computer Architecture, HPCA - 15 2009

ER -