Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

Ana Gainaru, Franck Cappello, Joshi Fullop, Stefan Trausan-Matu, William T Kramer

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which we lately use to predict the normal and faulty behaviour of the system. Our method uses a dynamic window strategy that is able to find frequent sequences of events regardless on the time delay between them. Most of the current related research narrows the correlation extraction to fixed and relatively small time windows that do not reflect the whole behaviour of the system. The generated events are in constant change during the lifetime of the machine. We consider that it is important to update the sequences at runtime by applying modifications after each prediction phase according to the forecast's accuracy and the difference between what was expected and what really happened. Our experiments show that our analysing system is able to predict around 60% of events with a precision of around 85% at a lower event granularity than before.

Original languageEnglish (US)
Title of host publicationManaging Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11
DOIs
StatePublished - Nov 17 2011
EventManaging Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11 - Cascais, Portugal
Duration: Oct 23 2011Oct 26 2011

Publication series

NameManaging Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11

Other

OtherManaging Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11
CountryPortugal
CityCascais
Period10/23/1110/26/11

Fingerprint

Large scale systems
Time delay
Experiments

Keywords

  • Event prediction
  • HPC systems
  • Logfile analysis

ASJC Scopus subject areas

  • Computer Science Applications
  • Information Systems
  • Software

Cite this

Gainaru, A., Cappello, F., Fullop, J., Trausan-Matu, S., & Kramer, W. T. (2011). Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11 [4] (Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11). https://doi.org/10.1145/2038633.2038637

Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. / Gainaru, Ana; Cappello, Franck; Fullop, Joshi; Trausan-Matu, Stefan; Kramer, William T.

Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11. 2011. 4 (Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Gainaru, A, Cappello, F, Fullop, J, Trausan-Matu, S & Kramer, WT 2011, Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. in Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11., 4, Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11, Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11, Cascais, Portugal, 10/23/11. https://doi.org/10.1145/2038633.2038637
Gainaru A, Cappello F, Fullop J, Trausan-Matu S, Kramer WT. Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. In Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11. 2011. 4. (Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11). https://doi.org/10.1145/2038633.2038637
Gainaru, Ana ; Cappello, Franck ; Fullop, Joshi ; Trausan-Matu, Stefan ; Kramer, William T. / Adaptive event prediction strategy with dynamic time window for large-scale HPC systems. Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11. 2011. (Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11).
@inproceedings{59cdba97649b43abb3c5d41c52887ff6,
title = "Adaptive event prediction strategy with dynamic time window for large-scale HPC systems",
abstract = "In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which we lately use to predict the normal and faulty behaviour of the system. Our method uses a dynamic window strategy that is able to find frequent sequences of events regardless on the time delay between them. Most of the current related research narrows the correlation extraction to fixed and relatively small time windows that do not reflect the whole behaviour of the system. The generated events are in constant change during the lifetime of the machine. We consider that it is important to update the sequences at runtime by applying modifications after each prediction phase according to the forecast's accuracy and the difference between what was expected and what really happened. Our experiments show that our analysing system is able to predict around 60{\%} of events with a precision of around 85{\%} at a lower event granularity than before.",
keywords = "Event prediction, HPC systems, Logfile analysis",
author = "Ana Gainaru and Franck Cappello and Joshi Fullop and Stefan Trausan-Matu and Kramer, {William T}",
year = "2011",
month = "11",
day = "17",
doi = "10.1145/2038633.2038637",
language = "English (US)",
isbn = "9781450309783",
series = "Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11",
booktitle = "Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11",

}

TY - GEN

T1 - Adaptive event prediction strategy with dynamic time window for large-scale HPC systems

AU - Gainaru, Ana

AU - Cappello, Franck

AU - Fullop, Joshi

AU - Trausan-Matu, Stefan

AU - Kramer, William T

PY - 2011/11/17

Y1 - 2011/11/17

N2 - In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which we lately use to predict the normal and faulty behaviour of the system. Our method uses a dynamic window strategy that is able to find frequent sequences of events regardless on the time delay between them. Most of the current related research narrows the correlation extraction to fixed and relatively small time windows that do not reflect the whole behaviour of the system. The generated events are in constant change during the lifetime of the machine. We consider that it is important to update the sequences at runtime by applying modifications after each prediction phase according to the forecast's accuracy and the difference between what was expected and what really happened. Our experiments show that our analysing system is able to predict around 60% of events with a precision of around 85% at a lower event granularity than before.

AB - In this paper, we analyse messages generated by different HPC large-scale systems in order to extract sequences of correlated events which we lately use to predict the normal and faulty behaviour of the system. Our method uses a dynamic window strategy that is able to find frequent sequences of events regardless on the time delay between them. Most of the current related research narrows the correlation extraction to fixed and relatively small time windows that do not reflect the whole behaviour of the system. The generated events are in constant change during the lifetime of the machine. We consider that it is important to update the sequences at runtime by applying modifications after each prediction phase according to the forecast's accuracy and the difference between what was expected and what really happened. Our experiments show that our analysing system is able to predict around 60% of events with a precision of around 85% at a lower event granularity than before.

KW - Event prediction

KW - HPC systems

KW - Logfile analysis

UR - http://www.scopus.com/inward/record.url?scp=81055139569&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=81055139569&partnerID=8YFLogxK

U2 - 10.1145/2038633.2038637

DO - 10.1145/2038633.2038637

M3 - Conference contribution

AN - SCOPUS:81055139569

SN - 9781450309783

T3 - Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11

BT - Managing Large-Scale Systems via the Analysis of System Logs and the Application of Machine Learning Techniques, SLAML'11

ER -