TY - GEN
T1 - Measuring and Understanding Extreme-Scale Application Resilience
T2 - 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015
AU - Martino, Catello Di
AU - Kramer, William
AU - Kalbarczyk, Zbigniew
AU - Iyer, Ravishankar
N1 - Publisher Copyright:
© 2015 IEEE.
PY - 2015/9/14
Y1 - 2015/9/14
N2 - This paper presents an in-depth characterization of the resiliency of more than 5 million HPC application runs completed during the first 518 production days of Blue Waters, a 13.1 petaflop Cray hybrid supercomputer. Unlike past work, we measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. The characterization is performed by means of a joint analysis of several data sources, which include workload and error/failure logs. In order to relate system errors and failures to the executed applications, we developed LogDiver, a tool to automate the data pre-processing and metric computation. Some of the lessons learned in this study include: i) while about 1.53% of applications fail due to system problems, the failed applications contribute to about 9% of the production node hours executed in the measured period, i.e., the system consumes computing resources, and system-related issues represent a potentially significant energy cost for the work lost, ii) there is a dramatic increase in the application failure probability when executing full-scale applications: 20x (from 0.008 to 0.162) when scaling XE applications from 10,000 to 22,000 nodes, and 6x (from 0.02 to 0.129) when scaling GPU/hybrid applications from 2000 to 4224 nodes, and iii) the resiliency of hybrid applications is impaired by the lack of adequate error detection capabilities in hybrid nodes.
AB - This paper presents an in-depth characterization of the resiliency of more than 5 million HPC application runs completed during the first 518 production days of Blue Waters, a 13.1 petaflop Cray hybrid supercomputer. Unlike past work, we measure the impact of system errors and failures on user applications, i.e., the compiled programs launched by user jobs that can execute across one or more XE (CPU) or XK (CPU+GPU) nodes. The characterization is performed by means of a joint analysis of several data sources, which include workload and error/failure logs. In order to relate system errors and failures to the executed applications, we developed LogDiver, a tool to automate the data pre-processing and metric computation. Some of the lessons learned in this study include: i) while about 1.53% of applications fail due to system problems, the failed applications contribute to about 9% of the production node hours executed in the measured period, i.e., the system consumes computing resources, and system-related issues represent a potentially significant energy cost for the work lost, ii) there is a dramatic increase in the application failure probability when executing full-scale applications: 20x (from 0.008 to 0.162) when scaling XE applications from 10,000 to 22,000 nodes, and 6x (from 0.02 to 0.129) when scaling GPU/hybrid applications from 2000 to 4224 nodes, and iii) the resiliency of hybrid applications is impaired by the lack of adequate error detection capabilities in hybrid nodes.
KW - application resilience
KW - data analysis
KW - data-driven resilience
KW - extreme-scale
KW - hybrid machines
KW - resilience
KW - supercomputer
UR - http://www.scopus.com/inward/record.url?scp=84950112790&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84950112790&partnerID=8YFLogxK
U2 - 10.1109/DSN.2015.50
DO - 10.1109/DSN.2015.50
M3 - Conference contribution
AN - SCOPUS:84950112790
T3 - Proceedings of the International Conference on Dependable Systems and Networks
SP - 25
EP - 36
BT - Proceedings - 2015 45th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2015
PB - IEEE Computer Society
Y2 - 22 June 2015 through 25 June 2015
ER -