TY - GEN
T1 - Improving Log-based Field Failure Data Analysis of multi-node computing systems
AU - Pecchia, Antonio
AU - Cotroneo, Domenico
AU - Kalbarczyk, Zbigniew
AU - Iyer, Ravishankar K.
N1 - Copyright:
Copyright 2011 Elsevier B.V., All rights reserved.
PY - 2011
Y1 - 2011
N2 - Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.
AB - Log-based Field Failure Data Analysis (FFDA) is a widely-adopted methodology to assess dependability properties of an operational system. A key step in FFDA is filtering out entries that are not useful and redundant error entries from the log. The latter is challenging: a fault, once triggered, can generate multiple errors that propagate within the system. Grouping the error entries related to the same fault manifestation is crucial to obtain realistic measurements. This paper deals with the issues of the tuple heuristic, used to group the error entries in the log, in multi-node computing systems. We demonstrate that the tuple heuristic can group entries incorrectly; thus, an improved heuristic that adopts statistical indicators is proposed. We assess the impact of inaccurate grouping on dependability measurements by comparing the results obtained with both the heuristics. The analysis encompasses the log of the Mercury cluster at the National Center for Supercomputing Applications.
KW - Field Failure Data Analysis
KW - collision
KW - dependability measurements
KW - supercomputer
KW - tuple heuristic
UR - http://www.scopus.com/inward/record.url?scp=80051915968&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80051915968&partnerID=8YFLogxK
U2 - 10.1109/DSN.2011.5958210
DO - 10.1109/DSN.2011.5958210
M3 - Conference contribution
AN - SCOPUS:80051915968
SN - 9781424492336
T3 - Proceedings of the International Conference on Dependable Systems and Networks
SP - 97
EP - 108
BT - 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks, DSN 2011
T2 - 2011 IEEE/IFIP 41st International Conference on Dependable Systems and Networks, DSN 2011
Y2 - 27 June 2011 through 30 June 2011
ER -