Abstract
As devices continue to scale, future shipped hardware will likely fail due to in-the-field hardware faults. As traditional redundancy-based hardware reliability solutions that tackle these faults will be too expensive to be broadly deployable, recent research has focused on low-overhead reliability solutions. One approach is to employ low-overhead ("always-on") detection techniques that catch high-level symptoms and pay a higher overhead for (rarely invoked) diagnosis. This paper presents trace-based fault diagnosis, a diagnosis strategy that identifies permanent faults in microarchitectural units by analyzing the faulty core's instruction trace. Once a fault is detected, the faulty core is rolled back and re-executes from a previous checkpoint, generating a faulty instruction trace and recording the microarchitecture-level resource usage. A diagnosis process on another fault-free core then generates a fault-free trace which it compares with the faulty trace to identify the faulty unit. Our result shows that this approach successfully diagnoses 98% of the faults studied and is a highly robust and flexible way for diagnosing permanent faults.
Original language | English (US) |
---|---|
Pages | 22-31 |
Number of pages | 10 |
DOIs | |
State | Published - 2008 |
Event | 2008 International Conference on Dependable Systems and Networks, DSN-2008 - Anchorage, AK, United States Duration: Jun 24 2008 → Jun 27 2008 |
Other
Other | 2008 International Conference on Dependable Systems and Networks, DSN-2008 |
---|---|
Country/Territory | United States |
City | Anchorage, AK |
Period | 6/24/08 → 6/27/08 |
ASJC Scopus subject areas
- Software
- Hardware and Architecture
- Computer Networks and Communications