Trace-based microarchitecture-level diagnosis of permanent hardware faults

Man Lap Li, Pradeep Ramachandran, Swamp K. Sahoo, Sarita V. Adve, Vikram S. Adve, Yuanyuan Zhou

Research output: Contribution to conferencePaper

Abstract

As devices continue to scale, future shipped hardware will likely fail due to in-the-field hardware faults. As traditional redundancy-based hardware reliability solutions that tackle these faults will be too expensive to be broadly deployable, recent research has focused on low-overhead reliability solutions. One approach is to employ low-overhead ("always-on") detection techniques that catch high-level symptoms and pay a higher overhead for (rarely invoked) diagnosis. This paper presents trace-based fault diagnosis, a diagnosis strategy that identifies permanent faults in microarchitectural units by analyzing the faulty core's instruction trace. Once a fault is detected, the faulty core is rolled back and re-executes from a previous checkpoint, generating a faulty instruction trace and recording the microarchitecture-level resource usage. A diagnosis process on another fault-free core then generates a fault-free trace which it compares with the faulty trace to identify the faulty unit. Our result shows that this approach successfully diagnoses 98% of the faults studied and is a highly robust and flexible way for diagnosing permanent faults.

Original languageEnglish (US)
Pages22-31
Number of pages10
DOIs
StatePublished - Oct 13 2008
Event2008 International Conference on Dependable Systems and Networks, DSN-2008 - Anchorage, AK, United States
Duration: Jun 24 2008Jun 27 2008

Other

Other2008 International Conference on Dependable Systems and Networks, DSN-2008
CountryUnited States
CityAnchorage, AK
Period6/24/086/27/08

Fingerprint

Hardware
Failure analysis
Redundancy

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Li, M. L., Ramachandran, P., Sahoo, S. K., Adve, S. V., Adve, V. S., & Zhou, Y. (2008). Trace-based microarchitecture-level diagnosis of permanent hardware faults. 22-31. Paper presented at 2008 International Conference on Dependable Systems and Networks, DSN-2008, Anchorage, AK, United States. https://doi.org/10.1109/DSN.2008.4630067

Trace-based microarchitecture-level diagnosis of permanent hardware faults. / Li, Man Lap; Ramachandran, Pradeep; Sahoo, Swamp K.; Adve, Sarita V.; Adve, Vikram S.; Zhou, Yuanyuan.

2008. 22-31 Paper presented at 2008 International Conference on Dependable Systems and Networks, DSN-2008, Anchorage, AK, United States.

Research output: Contribution to conferencePaper

Li, ML, Ramachandran, P, Sahoo, SK, Adve, SV, Adve, VS & Zhou, Y 2008, 'Trace-based microarchitecture-level diagnosis of permanent hardware faults', Paper presented at 2008 International Conference on Dependable Systems and Networks, DSN-2008, Anchorage, AK, United States, 6/24/08 - 6/27/08 pp. 22-31. https://doi.org/10.1109/DSN.2008.4630067
Li ML, Ramachandran P, Sahoo SK, Adve SV, Adve VS, Zhou Y. Trace-based microarchitecture-level diagnosis of permanent hardware faults. 2008. Paper presented at 2008 International Conference on Dependable Systems and Networks, DSN-2008, Anchorage, AK, United States. https://doi.org/10.1109/DSN.2008.4630067
Li, Man Lap ; Ramachandran, Pradeep ; Sahoo, Swamp K. ; Adve, Sarita V. ; Adve, Vikram S. ; Zhou, Yuanyuan. / Trace-based microarchitecture-level diagnosis of permanent hardware faults. Paper presented at 2008 International Conference on Dependable Systems and Networks, DSN-2008, Anchorage, AK, United States.10 p.
@conference{a966e4d942f44f3dad50e68d4af6bf5b,
title = "Trace-based microarchitecture-level diagnosis of permanent hardware faults",
abstract = "As devices continue to scale, future shipped hardware will likely fail due to in-the-field hardware faults. As traditional redundancy-based hardware reliability solutions that tackle these faults will be too expensive to be broadly deployable, recent research has focused on low-overhead reliability solutions. One approach is to employ low-overhead ({"}always-on{"}) detection techniques that catch high-level symptoms and pay a higher overhead for (rarely invoked) diagnosis. This paper presents trace-based fault diagnosis, a diagnosis strategy that identifies permanent faults in microarchitectural units by analyzing the faulty core's instruction trace. Once a fault is detected, the faulty core is rolled back and re-executes from a previous checkpoint, generating a faulty instruction trace and recording the microarchitecture-level resource usage. A diagnosis process on another fault-free core then generates a fault-free trace which it compares with the faulty trace to identify the faulty unit. Our result shows that this approach successfully diagnoses 98{\%} of the faults studied and is a highly robust and flexible way for diagnosing permanent faults.",
author = "Li, {Man Lap} and Pradeep Ramachandran and Sahoo, {Swamp K.} and Adve, {Sarita V.} and Adve, {Vikram S.} and Yuanyuan Zhou",
year = "2008",
month = "10",
day = "13",
doi = "10.1109/DSN.2008.4630067",
language = "English (US)",
pages = "22--31",
note = "2008 International Conference on Dependable Systems and Networks, DSN-2008 ; Conference date: 24-06-2008 Through 27-06-2008",

}

TY - CONF

T1 - Trace-based microarchitecture-level diagnosis of permanent hardware faults

AU - Li, Man Lap

AU - Ramachandran, Pradeep

AU - Sahoo, Swamp K.

AU - Adve, Sarita V.

AU - Adve, Vikram S.

AU - Zhou, Yuanyuan

PY - 2008/10/13

Y1 - 2008/10/13

N2 - As devices continue to scale, future shipped hardware will likely fail due to in-the-field hardware faults. As traditional redundancy-based hardware reliability solutions that tackle these faults will be too expensive to be broadly deployable, recent research has focused on low-overhead reliability solutions. One approach is to employ low-overhead ("always-on") detection techniques that catch high-level symptoms and pay a higher overhead for (rarely invoked) diagnosis. This paper presents trace-based fault diagnosis, a diagnosis strategy that identifies permanent faults in microarchitectural units by analyzing the faulty core's instruction trace. Once a fault is detected, the faulty core is rolled back and re-executes from a previous checkpoint, generating a faulty instruction trace and recording the microarchitecture-level resource usage. A diagnosis process on another fault-free core then generates a fault-free trace which it compares with the faulty trace to identify the faulty unit. Our result shows that this approach successfully diagnoses 98% of the faults studied and is a highly robust and flexible way for diagnosing permanent faults.

AB - As devices continue to scale, future shipped hardware will likely fail due to in-the-field hardware faults. As traditional redundancy-based hardware reliability solutions that tackle these faults will be too expensive to be broadly deployable, recent research has focused on low-overhead reliability solutions. One approach is to employ low-overhead ("always-on") detection techniques that catch high-level symptoms and pay a higher overhead for (rarely invoked) diagnosis. This paper presents trace-based fault diagnosis, a diagnosis strategy that identifies permanent faults in microarchitectural units by analyzing the faulty core's instruction trace. Once a fault is detected, the faulty core is rolled back and re-executes from a previous checkpoint, generating a faulty instruction trace and recording the microarchitecture-level resource usage. A diagnosis process on another fault-free core then generates a fault-free trace which it compares with the faulty trace to identify the faulty unit. Our result shows that this approach successfully diagnoses 98% of the faults studied and is a highly robust and flexible way for diagnosing permanent faults.

UR - http://www.scopus.com/inward/record.url?scp=53349142162&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=53349142162&partnerID=8YFLogxK

U2 - 10.1109/DSN.2008.4630067

DO - 10.1109/DSN.2008.4630067

M3 - Paper

AN - SCOPUS:53349142162

SP - 22

EP - 31

ER -