VirtCFT: A transparent VM-level fault-tolerant system for virtual clusters

Minjia Zhang, Hai Jin, Xuanhua Shi, Song Wu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

A virtual cluster consists of a multitude of virtual machines and software components that are doomed to fail eventually. In many environments, such failures can result in unanticipated, potentially devastating failure behavior and in service unavailability. The ability of failover is essential to the virtual cluster's availability, reliability, and manageability. Most of the existing methods have several common disadvantages: requiring modifications to the target processes or their OSes, which is usually error prone and sometimes impractical; only targeting at taking checkpoints of processes, not whole entire OS images, which limits the areas to be applied. In this paper we present VirtCFT, an innovative and practical system of fault tolerance for virtual cluster. VirtCFT is a system-level, coordinated distributed checkpointing fault tolerant system. It coordinates the distributed VMs to periodically reach the globally consistent state and take the checkpoint of the whole virtual cluster including states of CPU, memory, disk of each VM as well as the network communications. When faults occur, VirtCFT will automatically recover the entire virtual cluster to the correct state within a few seconds and keep it running. Superior to all the existing fault tolerance mechanisms, VirtCFT provides a simpler and totally transparent fault tolerant platform that allows existing, unmodified software and operating system (version unawareness) to be protected from the failure of the physical machine on which it runs. We have implemented this system based on the Xen virtualization platform. Our experiments with real-world benchmarks demonstrate the effectiveness and correctness of VirtCFT.

Original languageEnglish (US)
Title of host publicationProceedings - 16th International Conference on Parallel and Distributed Systems, ICPADS 2010
Pages147-154
Number of pages8
DOIs
StatePublished - 2010
Externally publishedYes
Event16th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2010 - Shanghai, China
Duration: Dec 8 2010Dec 10 2010

Publication series

NameProceedings of the International Conference on Parallel and Distributed Systems - ICPADS
ISSN (Print)1521-9097

Other

Other16th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2010
Country/TerritoryChina
CityShanghai
Period12/8/1012/10/10

Keywords

  • Coordinated checkpointing
  • Fault tolerance
  • High availability
  • Virtual machine

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'VirtCFT: A transparent VM-level fault-tolerant system for virtual clusters'. Together they form a unique fingerprint.

Cite this