TY - GEN
T1 - Taming of the shrew
T2 - 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
AU - Gainaru, Ana
AU - Cappello, Franck
AU - Kramer, William
PY - 2012
Y1 - 2012
N2 - HPC systems are complex machines that generate a huge volume of system state data called "events". Events are generated without following a general consistent rule and different hardware and software components of such systems have different failure rates. Distinguishing between normal system behaviour and faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for system administration and is the foundation of fault prediction. As HPC systems continue to grow in size and complexity, mining event flows become more challenging and with the upcoming 10 Pet flop systems, there is a lot of interestin this topic. Current event mining approaches do not take into consideration the specific behaviour of each type of events and as a consequence, fail to analyze them according to their characteristics. In this paper we propose a novel way of characterizing the normal and faulty behaviour of the system by using signal analysis concepts. All analysis modules create ELSA (Event Log Signal Analyzer), a toolkit that has the purpose of modelling the normal flow of each state event during a HPC system lifetime, and how it is affected when a failure hits the system. We show that these extracted models provide an accurate view of the system output, which improves the effectiveness of proactive fault tolerance algorithms. Specifically, we implemented a filtering algorithm and short-term fault prediction methodology based on the extracted model and test it against real failure traces from a large-scale system. We show that by analyzing each event according to its specific behaviour, we get a more realistic overview of the entire system.
AB - HPC systems are complex machines that generate a huge volume of system state data called "events". Events are generated without following a general consistent rule and different hardware and software components of such systems have different failure rates. Distinguishing between normal system behaviour and faulty situation relies on event analysis. Being able to detect quickly deviations from normality is essential for system administration and is the foundation of fault prediction. As HPC systems continue to grow in size and complexity, mining event flows become more challenging and with the upcoming 10 Pet flop systems, there is a lot of interestin this topic. Current event mining approaches do not take into consideration the specific behaviour of each type of events and as a consequence, fail to analyze them according to their characteristics. In this paper we propose a novel way of characterizing the normal and faulty behaviour of the system by using signal analysis concepts. All analysis modules create ELSA (Event Log Signal Analyzer), a toolkit that has the purpose of modelling the normal flow of each state event during a HPC system lifetime, and how it is affected when a failure hits the system. We show that these extracted models provide an accurate view of the system output, which improves the effectiveness of proactive fault tolerance algorithms. Specifically, we implemented a filtering algorithm and short-term fault prediction methodology based on the extracted model and test it against real failure traces from a large-scale system. We show that by analyzing each event according to its specific behaviour, we get a more realistic overview of the entire system.
KW - fault detection
KW - fault tolerance
KW - large-scale HPC systems
KW - signal analysis
UR - http://www.scopus.com/inward/record.url?scp=84866885057&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866885057&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2012.107
DO - 10.1109/IPDPS.2012.107
M3 - Conference contribution
AN - SCOPUS:84866885057
SN - 9780769546759
T3 - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
SP - 1168
EP - 1179
BT - Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS 2012
Y2 - 21 May 2012 through 25 May 2012
ER -