TY - GEN
T1 - LOGAIDER
T2 - 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
AU - Di, Sheng
AU - Gupta, Rinku
AU - Snir, Marc
AU - Pershey, Eric
AU - Cappello, Franck
N1 - Publisher Copyright:
© 2017 IEEE.
PY - 2017/7/10
Y1 - 2017/7/10
N2 - Today's large-scale supercomputers are producing a huge amount of log data. Exploring various potential correlations of fatal events is crucial for understanding their causality and improving the working efficiency for system administrators. To this end, we developed a toolkit, named LogAider, that can reveal three types of potential correlations: Across-field, spatial, and temporal. Across-field correlation refers to the statistical correlation across fields within a log or across multiple logs based on probabilistic analysis. For analyzing the spatial correlation of events, we developed a generic, easy-To-use visualizer that can view any events queried by userson a system machine graph. LogAider can also mine spatial correlations by an optimized K-meaning clustering algorithm over a Torus network topology. It is also able to disclose the temporal correlations (or error propagations) over a certain period inside a log or across multiple logs, based on an effective similarity analysis strategy. We assessed LogAider using theone-year reliability-Availability-serviceability (RAS) log of Mira system (one of the world's most powerful supercomputers), as well as its job log. We find that LogAider very helpful for revealing the potential correlations of fatal system events and job events, with an accurate mining of across-field correlation with both precision and recall of 99.9-100%, as well as precisedetection of temporal-correlation with a high similarity (up to 95%) to the ground-Truth.
AB - Today's large-scale supercomputers are producing a huge amount of log data. Exploring various potential correlations of fatal events is crucial for understanding their causality and improving the working efficiency for system administrators. To this end, we developed a toolkit, named LogAider, that can reveal three types of potential correlations: Across-field, spatial, and temporal. Across-field correlation refers to the statistical correlation across fields within a log or across multiple logs based on probabilistic analysis. For analyzing the spatial correlation of events, we developed a generic, easy-To-use visualizer that can view any events queried by userson a system machine graph. LogAider can also mine spatial correlations by an optimized K-meaning clustering algorithm over a Torus network topology. It is also able to disclose the temporal correlations (or error propagations) over a certain period inside a log or across multiple logs, based on an effective similarity analysis strategy. We assessed LogAider using theone-year reliability-Availability-serviceability (RAS) log of Mira system (one of the world's most powerful supercomputers), as well as its job log. We find that LogAider very helpful for revealing the potential correlations of fatal system events and job events, with an accurate mining of across-field correlation with both precision and recall of 99.9-100%, as well as precisedetection of temporal-correlation with a high similarity (up to 95%) to the ground-Truth.
UR - http://www.scopus.com/inward/record.url?scp=85027449706&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85027449706&partnerID=8YFLogxK
U2 - 10.1109/CCGRID.2017.18
DO - 10.1109/CCGRID.2017.18
M3 - Conference contribution
AN - SCOPUS:85027449706
T3 - Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
SP - 442
EP - 451
BT - Proceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 14 May 2017 through 17 May 2017
ER -