TY - GEN
T1 - Relyzer
T2 - 17th International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2012
AU - Hari, Siva Kumar Sastry
AU - Adve, Sarita V.
AU - Naeimi, Helia
AU - Ramachandran, Pradeep
PY - 2012
Y1 - 2012
N2 - Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains a non-negligible risk that several faults may escape the symptom detectors and result in silent data corruptions (SDCs). Most prior evaluations of symptom-based detectors perform fault injection campaigns on application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the application's execution (application fault site). Since the total number of application fault sites is very large (trillions for standard benchmark suites), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly selected sample of faults. Such studies do not provide any feedback on the portions of the application where faults were not injected. Some of those instructions may be vulnerable to SDCs, and identifying them could allow protecting them through other means if needed. This paper presents Relyzer, an approach that systematically analyzes all application fault sites and carefully picks a small subset to perform selective fault injections for transient faults. Relyzer employs novel fault pruning techniques that prune faults that need detailed study by either predicting their outcomes or showing them equivalent to other faults. We find that Relyzer prunes about 99.78% of the total faults across twelve applications studied here, reducing the faults that require detailed simulation by 3 to 5 orders of magnitude for most of the applications. Fault injection simulations on the remaining faults can identify SDC causing faults in the entire application. Some of Relyzer's techniques rely on heuristics to determine fault equivalence. Our validation efforts show that Relyzer determines fault outcomes with 96% accuracy, averaged across all the applications studied here.
AB - Future microprocessors need low-cost solutions for reliable operation in the presence of failure-prone devices. A promising approach is to detect hardware faults by deploying low-cost monitors of software-level symptoms of such faults. Recently, researchers have shown these mechanisms work well, but there remains a non-negligible risk that several faults may escape the symptom detectors and result in silent data corruptions (SDCs). Most prior evaluations of symptom-based detectors perform fault injection campaigns on application benchmarks, where each run simulates the impact of a fault injected at a hardware site at a certain point in the application's execution (application fault site). Since the total number of application fault sites is very large (trillions for standard benchmark suites), it is not feasible to study all possible faults. Previous work therefore typically studies a randomly selected sample of faults. Such studies do not provide any feedback on the portions of the application where faults were not injected. Some of those instructions may be vulnerable to SDCs, and identifying them could allow protecting them through other means if needed. This paper presents Relyzer, an approach that systematically analyzes all application fault sites and carefully picks a small subset to perform selective fault injections for transient faults. Relyzer employs novel fault pruning techniques that prune faults that need detailed study by either predicting their outcomes or showing them equivalent to other faults. We find that Relyzer prunes about 99.78% of the total faults across twelve applications studied here, reducing the faults that require detailed simulation by 3 to 5 orders of magnitude for most of the applications. Fault injection simulations on the remaining faults can identify SDC causing faults in the entire application. Some of Relyzer's techniques rely on heuristics to determine fault equivalence. Our validation efforts show that Relyzer determines fault outcomes with 96% accuracy, averaged across all the applications studied here.
KW - architecture
KW - hardware reliability evaluation
KW - low-cost hardware resiliency
KW - silent data corruption
KW - transient faults
UR - http://www.scopus.com/inward/record.url?scp=84858759524&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84858759524&partnerID=8YFLogxK
U2 - 10.1145/2150976.2150990
DO - 10.1145/2150976.2150990
M3 - Conference contribution
AN - SCOPUS:84858759524
SN - 9781450307598
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 123
EP - 134
BT - ASPLOS XVII - 17th International Conference on Architectural Support for Programming Languages and Operating Systems
Y2 - 3 March 2012 through 7 March 2012
ER -