TY - GEN
T1 - Reducing Waste in Extreme Scale Systems through Introspective Analysis
AU - Bautista-Gomez, Leonardo
AU - Gainaru, Ana
AU - Perarnau, Swann
AU - Tiwari, Devesh
AU - Gupta, Saurabh
AU - Engelmann, Christian
AU - Cappello, Franck
AU - Snir, Marc
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/18
Y1 - 2016/7/18
N2 - Resilience is an important challenge for extreme-scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. Our study of the failure logs of multiple supercomputers show that periods of higher failure density occur with up to three times more than the average. We design a monitoring system that listens to hardware events and forwards important events to the runtime to detect those regime changes. We implement a runtime capable of receiving notifications and adapt dynamically. In addition, we build an analytical model to predict the gains that such dynamic approach could achieve. We demonstrate that in some systems, our approach can reduce the wasted time by over 30%.
AB - Resilience is an important challenge for extreme-scale supercomputers. Today, failures in supercomputers are assumed to be uniformly distributed in time. However, recent studies show that failures in high-performance computing systems are partially correlated in time, generating periods of higher failure density. Our study of the failure logs of multiple supercomputers show that periods of higher failure density occur with up to three times more than the average. We design a monitoring system that listens to hardware events and forwards important events to the runtime to detect those regime changes. We implement a runtime capable of receiving notifications and adapt dynamically. In addition, we build an analytical model to predict the gains that such dynamic approach could achieve. We demonstrate that in some systems, our approach can reduce the wasted time by over 30%.
KW - Fault Tolerance
KW - Introspective Systems
KW - Resilience
KW - Silent Data Corruption
KW - Soft Errors
KW - Supercomputers
UR - http://www.scopus.com/inward/record.url?scp=84983358988&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84983358988&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2016.100
DO - 10.1109/IPDPS.2016.100
M3 - Conference contribution
AN - SCOPUS:84983358988
T3 - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
SP - 212
EP - 221
BT - Proceedings - 2016 IEEE 30th International Parallel and Distributed Processing Symposium, IPDPS 2016
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 30th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2016
Y2 - 23 May 2016 through 27 May 2016
ER -