Modeling coordinated checkpointing for large-scale supercomputers

Long Wang, Karthik Pattabiraman, Zbigniew Kalbarczyk, Ravishankar K. Iyer, Lawrence Votta, Christopher Vick, Alan Wood

Research output: Contribution to conferencePaper

Abstract

Current supercomputing systems consisting of thousands of nodes cannot meet the demands of emerging high-performance scientific applications. As a result, a new generation of supercomputing systems consisting of hundreds of thousands of nodes is being proposed. However, these systems are likely to experience far more frequent failures than today's systems, and such failures must be tackled effectively. Coordinated checkpointing is a common technique to deal with failures in supercomputers. This paper presents a model of a coordinated checkpointing protocol for large-scale supercomputers, and studies its scalability by considering both the coordination overhead and the effect of failures. Unlike most of the existing checkpointing models, the proposed model takes into account failures during checkpointing and recovery, as well as correlated failures. Stochastic Activity Networks (SANs) are used to model the system, and the model is simulated to study the scalability, reliability, and performance of the system.

Original languageEnglish (US)
Pages812-821
Number of pages10
StatePublished - Nov 9 2005
Event2005 International Conference on Dependable Systems and Networks - Yokohama, Japan
Duration: Jun 28 2005Jul 1 2005

Other

Other2005 International Conference on Dependable Systems and Networks
CountryJapan
CityYokohama
Period6/28/057/1/05

    Fingerprint

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Wang, L., Pattabiraman, K., Kalbarczyk, Z., Iyer, R. K., Votta, L., Vick, C., & Wood, A. (2005). Modeling coordinated checkpointing for large-scale supercomputers. 812-821. Paper presented at 2005 International Conference on Dependable Systems and Networks, Yokohama, Japan.