TY - GEN
T1 - Quantitative analysis of long latency failures in system software
AU - Yim, Keun Soo
AU - Kalbarczyk, Zbigniew T
AU - Iyer, Ravishankar K
PY - 2009
Y1 - 2009
N2 - This paper presents a study on long latency failures using accelerated fault injection. The data collected from the experiments are used to analyze the significance, causes, and characteristics of long latency failures caused by soft errors in the processor and the memory. The results indicate that a non-negligible portion of soft errors in the code and data memory lead to long latency failures. The long latency failures are caused by errors with long fault activation times and errors causing failures only under certain runtime conditions. On the other hand, less than 0.5% of soft errors in the processor registers used in kernel mode lead to a failure with latency longer than a thousand seconds. This is due to a strong temporal locality of the register values. The study shows also that the obtained insight can be used to guide design and placement (in the application code and/or system) of application-specific error detectors.
AB - This paper presents a study on long latency failures using accelerated fault injection. The data collected from the experiments are used to analyze the significance, causes, and characteristics of long latency failures caused by soft errors in the processor and the memory. The results indicate that a non-negligible portion of soft errors in the code and data memory lead to long latency failures. The long latency failures are caused by errors with long fault activation times and errors causing failures only under certain runtime conditions. On the other hand, less than 0.5% of soft errors in the processor registers used in kernel mode lead to a failure with latency longer than a thousand seconds. This is due to a strong temporal locality of the register values. The study shows also that the obtained insight can be used to guide design and placement (in the application code and/or system) of application-specific error detectors.
KW - Accelerated fault injection
KW - Long latency failures
KW - Operating system robustness testing
UR - http://www.scopus.com/inward/record.url?scp=77649290972&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77649290972&partnerID=8YFLogxK
U2 - 10.1109/PRDC.2009.13
DO - 10.1109/PRDC.2009.13
M3 - Conference contribution
AN - SCOPUS:77649290972
SN - 9780769538495
T3 - 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2009
SP - 23
EP - 30
BT - 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2009
T2 - 2009 15th IEEE Pacific Rim International Symposium on Dependable Computing, PRDC 2009
Y2 - 16 November 2009 through 18 November 2009
ER -