ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors

Milos Prvulovic, Zheng Zhang, Josep Torrellas

Research output: Contribution to journalArticlepeer-review

Abstract

This paper presents ReVive, a novel general-purpose rollback recovery mechanism for shared-memory multiprocessors. ReVive carefully balances the conflicting requirements of availability, performance, and hardware cost. ReVive performs checkpointing, logging, and distributed parity protection, all memory-based. It enables recovery from a wide class of errors, including the permanent loss of an entire node. To maintain high performance, ReVive includes specialized hardware that performs frequent operations in the background, such as log and parity updates. To keep the cost low, more complex checkpointing and recovery functions are performed in software, while the hardware modifications are limited to the directory controllers of the machine. Our simulation results on a 16-processor system indicate that the average error-flee execution time overhead of using ReVive is only 6.3%, while the achieved availability is better than 99.999% even when the errors occur as often as once per day.

Original languageEnglish (US)
Pages (from-to)111-122
Number of pages12
JournalConference Proceedings - Annual International Symposium on Computer Architecture, ISCA
DOIs
StatePublished - Jan 1 2002

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint Dive into the research topics of 'ReVive: Cost-effective architectural support for rollback recovery in shared-memory multiprocessors'. Together they form a unique fingerprint.

Cite this