BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds

Bogdan Nicolae, Franck Cappello

Research output: Contribution to journalArticlepeer-review

Abstract

Infrastructure-as-a-Service (IaaS) cloud computing is gaining significant interest in industry and academia as an alternative platform for running HPC applications. Given the need to provide fault tolerance, support for suspend-resume and offline migration, an efficient Checkpoint-Restart mechanism becomes paramount in this context. We propose BlobCR, a dedicated checkpoint repository that is able to take live incremental snapshots of the whole disk attached to the virtual machine (VM) instances. BlobCR aims to minimize the performance overhead of checkpointing by persisting VM disk snapshots asynchronously in the background using a low overhead technique we call selective copy-on-write. It includes support for both application-level and process-level checkpointing, as well as support to roll back filesystem changes. Experiments at large scale demonstrate the benefits of our proposal both in synthetic settings and for a real-life HPC application.

Original languageEnglish (US)
Pages (from-to)698-711
Number of pages14
JournalJournal of Parallel and Distributed Computing
Volume73
Issue number5
DOIs
StatePublished - May 2013

Keywords

  • Checkpoint-restart
  • Fault tolerance
  • High performance computing
  • IaaS clouds
  • Rollback of filesystem changes
  • Virtual disk snapshots

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computer Networks and Communications
  • Artificial Intelligence

Fingerprint

Dive into the research topics of 'BlobCR: Virtual disk based checkpoint-restart for HPC applications on IaaS clouds'. Together they form a unique fingerprint.

Cite this