TY - GEN
T1 - Module prototype for online failure prediction for the IBM blue Gene/L
AU - Solano-Quinde, Lizandro D.
AU - Bode, Brett M.
PY - 2008
Y1 - 2008
N2 - The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system. Although reactive fault tolerant policies effectively minimize the effects of faults, it has been shown that these techniques drastically reduce the system performance. Proactive fault tolerant policies have emerged as an alternative due to the reduced performance degradation they impose. Proactive fault tolerant policies are based on the analysis of information about the state of the system. The monitoring system of the IBM Blue Gene/L generates online information about the state of hardware and software of the system and stores that information in the RAS event log. In this study, we design and implement a module prototype for online failure prediction. This prototype is tested and validated, on a realistic scenario, using the RAS event log of an IBM Blue Gene/L system. We show that our module prototype for failure prediction predicts up to 70% of the fatal events.
AB - The growing complexity of scientific applications has led to the design and deployment of large-scale parallel systems. The IBM Blue Gene/L can hold in excess of 200K processors and it has been designed for high performance and reliability. However, failures in this large-scale parallel system are a major concern, since it has been demonstrated that a failure will significantly reduce the performance of the system. Although reactive fault tolerant policies effectively minimize the effects of faults, it has been shown that these techniques drastically reduce the system performance. Proactive fault tolerant policies have emerged as an alternative due to the reduced performance degradation they impose. Proactive fault tolerant policies are based on the analysis of information about the state of the system. The monitoring system of the IBM Blue Gene/L generates online information about the state of hardware and software of the system and stores that information in the RAS event log. In this study, we design and implement a module prototype for online failure prediction. This prototype is tested and validated, on a realistic scenario, using the RAS event log of an IBM Blue Gene/L system. We show that our module prototype for failure prediction predicts up to 70% of the fatal events.
KW - Blue Gene/L
KW - Computer Fault Tolerance
KW - Failure analysis
KW - Software fault tolerance
UR - http://www.scopus.com/inward/record.url?scp=51349083735&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=51349083735&partnerID=8YFLogxK
U2 - 10.1109/EIT.2008.4554349
DO - 10.1109/EIT.2008.4554349
M3 - Conference contribution
AN - SCOPUS:51349083735
SN - 9781424420308
T3 - 2008 IEEE International Conference on Electro/Information Technology, IEEE EIT 2008 Conference
SP - 470
EP - 474
BT - 2008 IEEE International Conference on Electro/Information Technology, IEEE EIT 2008 Conference
T2 - 2008 IEEE International Conference on Electro/Information Technology, IEEE EIT 2008 Conference
Y2 - 18 May 2008 through 20 May 2008
ER -