TY - GEN
T1 - Modeling and tolerating heterogeneous failures in large parallel systems
AU - Heien, Eric
AU - Kondo, Derrick
AU - Gainaru, Ana
AU - Lapine, Dan
AU - Kramer, Bill
AU - Cappello, Franck
PY - 2011
Y1 - 2011
N2 - As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such syste assume failures equally affect an application, whe ms can have different failure rates. Prior works reas our goal is to provide failure models for applications that reect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time. To this end, we study 5 years of system logs from a production high-performance computing system and model hardware failures involving processors, memory, storage and network components. We model each component and con- struct integrated failure models given the component usage of common supercomputing applications. We show that these application-centric models provide more accurate reliability estimates compared to general models, which improves the efficacy of fault-tolerant algorithms. In particular, we demonstrate how applications can tune their checkpointing strategies to the tailored model.
AB - As supercomputers and clusters increase in size and complexity, system failures are inevitable. Different hardware components (such as memory, disk, or network) of such syste assume failures equally affect an application, whe ms can have different failure rates. Prior works reas our goal is to provide failure models for applications that reect their specific component usage. This is challenging because component failure dynamics are heterogeneous in space and time. To this end, we study 5 years of system logs from a production high-performance computing system and model hardware failures involving processors, memory, storage and network components. We model each component and con- struct integrated failure models given the component usage of common supercomputing applications. We show that these application-centric models provide more accurate reliability estimates compared to general models, which improves the efficacy of fault-tolerant algorithms. In particular, we demonstrate how applications can tune their checkpointing strategies to the tailored model.
UR - http://www.scopus.com/inward/record.url?scp=83155160934&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83155160934&partnerID=8YFLogxK
U2 - 10.1145/2063384.2063444
DO - 10.1145/2063384.2063444
M3 - Conference contribution
AN - SCOPUS:83155160934
SN - 9781450307710
T3 - Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
BT - Proceedings of 2011 SC - International Conference for High Performance Computing, Networking, Storage and Analysis
T2 - 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC11
Y2 - 12 November 2011 through 18 November 2011
ER -