TY - GEN
T1 - VirtCFT
T2 - 16th IEEE International Conference on Parallel and Distributed Systems, ICPADS 2010
AU - Zhang, Minjia
AU - Jin, Hai
AU - Shi, Xuanhua
AU - Wu, Song
PY - 2010
Y1 - 2010
N2 - A virtual cluster consists of a multitude of virtual machines and software components that are doomed to fail eventually. In many environments, such failures can result in unanticipated, potentially devastating failure behavior and in service unavailability. The ability of failover is essential to the virtual cluster's availability, reliability, and manageability. Most of the existing methods have several common disadvantages: requiring modifications to the target processes or their OSes, which is usually error prone and sometimes impractical; only targeting at taking checkpoints of processes, not whole entire OS images, which limits the areas to be applied. In this paper we present VirtCFT, an innovative and practical system of fault tolerance for virtual cluster. VirtCFT is a system-level, coordinated distributed checkpointing fault tolerant system. It coordinates the distributed VMs to periodically reach the globally consistent state and take the checkpoint of the whole virtual cluster including states of CPU, memory, disk of each VM as well as the network communications. When faults occur, VirtCFT will automatically recover the entire virtual cluster to the correct state within a few seconds and keep it running. Superior to all the existing fault tolerance mechanisms, VirtCFT provides a simpler and totally transparent fault tolerant platform that allows existing, unmodified software and operating system (version unawareness) to be protected from the failure of the physical machine on which it runs. We have implemented this system based on the Xen virtualization platform. Our experiments with real-world benchmarks demonstrate the effectiveness and correctness of VirtCFT.
AB - A virtual cluster consists of a multitude of virtual machines and software components that are doomed to fail eventually. In many environments, such failures can result in unanticipated, potentially devastating failure behavior and in service unavailability. The ability of failover is essential to the virtual cluster's availability, reliability, and manageability. Most of the existing methods have several common disadvantages: requiring modifications to the target processes or their OSes, which is usually error prone and sometimes impractical; only targeting at taking checkpoints of processes, not whole entire OS images, which limits the areas to be applied. In this paper we present VirtCFT, an innovative and practical system of fault tolerance for virtual cluster. VirtCFT is a system-level, coordinated distributed checkpointing fault tolerant system. It coordinates the distributed VMs to periodically reach the globally consistent state and take the checkpoint of the whole virtual cluster including states of CPU, memory, disk of each VM as well as the network communications. When faults occur, VirtCFT will automatically recover the entire virtual cluster to the correct state within a few seconds and keep it running. Superior to all the existing fault tolerance mechanisms, VirtCFT provides a simpler and totally transparent fault tolerant platform that allows existing, unmodified software and operating system (version unawareness) to be protected from the failure of the physical machine on which it runs. We have implemented this system based on the Xen virtualization platform. Our experiments with real-world benchmarks demonstrate the effectiveness and correctness of VirtCFT.
KW - Coordinated checkpointing
KW - Fault tolerance
KW - High availability
KW - Virtual machine
UR - http://www.scopus.com/inward/record.url?scp=79951738630&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79951738630&partnerID=8YFLogxK
U2 - 10.1109/ICPADS.2010.125
DO - 10.1109/ICPADS.2010.125
M3 - Conference contribution
AN - SCOPUS:79951738630
SN - 9780769543079
T3 - Proceedings of the International Conference on Parallel and Distributed Systems - ICPADS
SP - 147
EP - 154
BT - Proceedings - 16th International Conference on Parallel and Distributed Systems, ICPADS 2010
Y2 - 8 December 2010 through 10 December 2010
ER -