Compiler optimizations for fault tolerance software checking

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Dramatic increases in the number of transistors that can be integrated on a chip will make the hardware more susceptible to radiation-induced transient errors. High-end architectures like the IBM mainframes, HP NonStop or mission-critical computers are likely to include several hardware-intensive fault tolerance techniques. However, the commodity chips which are cost- and energy-constrained, will need a more flexible and inexpensive technology for error detection. Software approaches can play a major role for this sector of the market because they need little hardware modification and can be tailored to fit different requirements of reliability and performance. Current software approaches address the problem by replicating the instructions and adding checking instructions to compare the results [1, 2, 3, 4, 5]. These checking instructions account for a significant fraction of the added overhead. In this work we propose a set of compiler optimizations to identify and remove redundant checks from the replicated code. Two checks are considered redundant if they check the same variable. In this case, it is possible to remove the check that appears first during execution so that an error will be detected when the second check executes. However, determining how much a check can be delayed is tricky. If we delay it too little, there is little room for optimization. If we delay it too much, the errors will propagate to undesired places and result in segmentation faults, corrupted memory, wrong execution path, or undetected errors across checkpoints. We consider that how much the error detection can be delayed will depend on the recovery mechanism supported by the hardware or the system. As long as checks are not delayed beyond synchronization checkpoints, the system will be able to properly recover. With our techniques the user can define what are the synchronization checkpoints based on the hardware support for recovery. In this work we evaluate two levels of hardware or system support: memory without support for checkpointing and rollback, where memory is guaranteed to not be corrupted with wrong values and memory with low-cost support for checkpointing and rollback. We also consider the situation where register file is protected with parity or ECC, such as Intel Itanium, Sun UltraSPARC and IBM Power4-6 because software implementations can take advantage of this hardware feature and reduce some of the replicated instructions. We have evaluated our approach using LLVM as our compiler infrastructure and PIN for fault injection. Our experimental results with Spec benchmarks on a Pentium 4 show that in the case where memory is guaranteed not to be corrupted, performance improves by an average 6.2%. With more support for checkpoint performance improves by an average 14.7%. A software fault tolerant system that takes advantage of the register safe platforms improves by an average 16.0%. Fault injection experiments show that our techniques do not decrease fault coverage, although they slightly increase the number of segmentation faults.

Original languageEnglish (US)
Title of host publication16th International Conference on Parallel Architecture and Compilation Techniques, PACT 2007
Number of pages1
StatePublished - 2007
Event16th International Conference on Parallel Architecture and Compilation Techniques, PACT 2007 - Brasov, Romania
Duration: Sep 15 2007Sep 19 2007

Publication series

NameParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
ISSN (Print)1089-795X


Conference16th International Conference on Parallel Architecture and Compilation Techniques, PACT 2007

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture


Dive into the research topics of 'Compiler optimizations for fault tolerance software checking'. Together they form a unique fingerprint.

Cite this