LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events

Sheng Di, Rinku Gupta, Marc Snir, Eric Pershey, Franck Cappello

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Today's large-scale supercomputers are producing a huge amount of log data. Exploring various potential correlations of fatal events is crucial for understanding their causality and improving the working efficiency for system administrators. To this end, we developed a toolkit, named LogAider, that can reveal three types of potential correlations: Across-field, spatial, and temporal. Across-field correlation refers to the statistical correlation across fields within a log or across multiple logs based on probabilistic analysis. For analyzing the spatial correlation of events, we developed a generic, easy-To-use visualizer that can view any events queried by userson a system machine graph. LogAider can also mine spatial correlations by an optimized K-meaning clustering algorithm over a Torus network topology. It is also able to disclose the temporal correlations (or error propagations) over a certain period inside a log or across multiple logs, based on an effective similarity analysis strategy. We assessed LogAider using theone-year reliability-Availability-serviceability (RAS) log of Mira system (one of the world's most powerful supercomputers), as well as its job log. We find that LogAider very helpful for revealing the potential correlations of fatal system events and job events, with an accurate mining of across-field correlation with both precision and recall of 99.9-100%, as well as precisedetection of temporal-correlation with a high similarity (up to 95%) to the ground-Truth.

Original languageEnglish (US)
Title of host publicationProceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages442-451
Number of pages10
ISBN (Electronic)9781509066100
DOIs
StatePublished - Jul 10 2017
Event17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017 - Madrid, Spain
Duration: May 14 2017May 17 2017

Publication series

NameProceedings - 2017 17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017

Other

Other17th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, CCGRID 2017
CountrySpain
CityMadrid
Period5/14/175/17/17

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'LOGAIDER: A Tool for Mining Potential Correlations of HPC Log Events'. Together they form a unique fingerprint.

Cite this