TY - GEN
T1 - Low-cost program-level detectors for reducing silent data corruptions
AU - Hari, Siva Kumar Sastry
AU - Adve, Sarita V.
AU - Naeimi, Helia
PY - 2012
Y1 - 2012
N2 - With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.
AB - With technology scaling, transient faults are becoming an increasing threat to hardware reliability. Commodity systems must be made resilient to these in-field faults through very low-cost resiliency solutions. Software-level symptom detection techniques have emerged as promising low-cost and effective solutions. While the current user-visible Silent Data Corruption (SDC) rates for these techniques is relatively low, eliminating or significantly lowering the SDC rate is crucial for these solutions to become practically successful. Identifying and understanding program sections that cause SDCs is crucial to reducing (or eliminating) SDCs in a cost effective manner. This paper provides a detailed analysis of code sections that produce over 90% of SDCs for six applications we studied. This analysis facilitated the development of program-level detectors that catch errors in quantities that are either accumulated or active for a long duration, amortizing the detection costs. These low-cost detectors significantly reduce the dependency on redundancy-based techniques and provide more practical and flexible choice points on the performance vs. reliability trade-off curve. For example, for an average of 90%, 99%, or 100% reduction of the baseline SDC rate, the average execution overheads of our approach versus redundancy alone are respectively 12% vs. 30%, 19% vs. 43%, and 27% vs. 51%.
KW - Application resiliency
KW - Hardware reliability
KW - Silent data corruptions
KW - Symptom-based fault detection
KW - Transient faults
UR - http://www.scopus.com/inward/record.url?scp=84866653671&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866653671&partnerID=8YFLogxK
U2 - 10.1109/DSN.2012.6263960
DO - 10.1109/DSN.2012.6263960
M3 - Conference contribution
AN - SCOPUS:84866653671
SN - 9781467316248
T3 - Proceedings of the International Conference on Dependable Systems and Networks
BT - 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012
T2 - 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012
Y2 - 25 June 2012 through 28 June 2012
ER -