TY - GEN
T1 - Lessons learned from the analysis of system failures at petascale
T2 - 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2014
AU - Di Martino, Catello
AU - Kalbarczyk, Zbigniew
AU - Iyer, Ravishankar K.
AU - Baccanico, Fabio
AU - Fullop, Joseph
AU - Kramer, William
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014/9/18
Y1 - 2014/9/18
N2 - This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.
AB - This paper provides an analysis of failures and their impact for Blue Waters, the Cray hybrid (CPU/GPU) supercomputer at the University of Illinois at Urbana-Champaign. The analysis is based on both manual failure reports and automatically generated event logs collected over 261 days. Results include i) a characterization of the root causes of single-node failures, ii) a direct assessment of the effectiveness of system-level fail over as well as memory, processor, network, GPU accelerator, and file system error resiliency, and iii) an analysis of system-wide outages. The major findings of this study are as follows. Hardware is not the main cause of system downtime. This is notwithstanding the fact that hardware-related failures are 42% of all failures. Failures caused by hardware were responsible for only 23% of the total repair time. These results are partially due to the fact that processor and memory protection mechanisms (x8 and x4 Chip kill, ECC, and parity) are able to handle a sustained rate of errors as high as 250 errors/h while providing a coverage of 99.997% out of a set of more than 1.5 million of analyzed errors. Only 28 multiple-bit errors bypassed the employed protection mechanisms. Software, on the other hand, was the largest contributor to the node repair hours (53%), despite being the cause of only 20% of the total number of failures. A total of 29 out of 39 system-wide outages involved the Lustre file system with 42% of them caused by the inadequacy of the automated fail over procedures.
KW - Cray XE6
KW - Cray XK7
KW - Failure Analysis
KW - Failure Reports
KW - Machine Check
KW - Nvidia GPU errors
KW - Supercomputer
UR - http://www.scopus.com/inward/record.url?scp=84912075762&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84912075762&partnerID=8YFLogxK
U2 - 10.1109/DSN.2014.62
DO - 10.1109/DSN.2014.62
M3 - Conference contribution
AN - SCOPUS:84912075762
T3 - Proceedings of the International Conference on Dependable Systems and Networks
SP - 610
EP - 621
BT - Proceedings of the International Conference on Dependable Systems and Networks
PB - IEEE Computer Society
Y2 - 23 June 2014 through 26 June 2014
ER -