TY - GEN
T1 - Compiler optimizations for fault tolerance software checking
AU - Yu, Jing
AU - Garzarán, María Jesús
PY - 2007
Y1 - 2007
N2 - Dramatic increases in the number of transistors that can be integrated on a chip will make the hardware more susceptible to radiation-induced transient errors. High-end architectures like the IBM mainframes, HP NonStop or mission-critical computers are likely to include several hardware-intensive fault tolerance techniques. However, the commodity chips which are cost- and energy-constrained, will need a more flexible and inexpensive technology for error detection. Software approaches can play a major role for this sector of the market because they need little hardware modification and can be tailored to fit different requirements of reliability and performance. Current software approaches address the problem by replicating the instructions and adding checking instructions to compare the results [1, 2, 3, 4, 5]. These checking instructions account for a significant fraction of the added overhead. In this work we propose a set of compiler optimizations to identify and remove redundant checks from the replicated code. Two checks are considered redundant if they check the same variable. In this case, it is possible to remove the check that appears first during execution so that an error will be detected when the second check executes. However, determining how much a check can be delayed is tricky. If we delay it too little, there is little room for optimization. If we delay it too much, the errors will propagate to undesired places and result in segmentation faults, corrupted memory, wrong execution path, or undetected errors across checkpoints. We consider that how much the error detection can be delayed will depend on the recovery mechanism supported by the hardware or the system. As long as checks are not delayed beyond synchronization checkpoints, the system will be able to properly recover. With our techniques the user can define what are the synchronization checkpoints based on the hardware support for recovery. In this work we evaluate two levels of hardware or system support: memory without support for checkpointing and rollback, where memory is guaranteed to not be corrupted with wrong values and memory with low-cost support for checkpointing and rollback. We also consider the situation where register file is protected with parity or ECC, such as Intel Itanium, Sun UltraSPARC and IBM Power4-6 because software implementations can take advantage of this hardware feature and reduce some of the replicated instructions. We have evaluated our approach using LLVM as our compiler infrastructure and PIN for fault injection. Our experimental results with Spec benchmarks on a Pentium 4 show that in the case where memory is guaranteed not to be corrupted, performance improves by an average 6.2%. With more support for checkpoint performance improves by an average 14.7%. A software fault tolerant system that takes advantage of the register safe platforms improves by an average 16.0%. Fault injection experiments show that our techniques do not decrease fault coverage, although they slightly increase the number of segmentation faults.
AB - Dramatic increases in the number of transistors that can be integrated on a chip will make the hardware more susceptible to radiation-induced transient errors. High-end architectures like the IBM mainframes, HP NonStop or mission-critical computers are likely to include several hardware-intensive fault tolerance techniques. However, the commodity chips which are cost- and energy-constrained, will need a more flexible and inexpensive technology for error detection. Software approaches can play a major role for this sector of the market because they need little hardware modification and can be tailored to fit different requirements of reliability and performance. Current software approaches address the problem by replicating the instructions and adding checking instructions to compare the results [1, 2, 3, 4, 5]. These checking instructions account for a significant fraction of the added overhead. In this work we propose a set of compiler optimizations to identify and remove redundant checks from the replicated code. Two checks are considered redundant if they check the same variable. In this case, it is possible to remove the check that appears first during execution so that an error will be detected when the second check executes. However, determining how much a check can be delayed is tricky. If we delay it too little, there is little room for optimization. If we delay it too much, the errors will propagate to undesired places and result in segmentation faults, corrupted memory, wrong execution path, or undetected errors across checkpoints. We consider that how much the error detection can be delayed will depend on the recovery mechanism supported by the hardware or the system. As long as checks are not delayed beyond synchronization checkpoints, the system will be able to properly recover. With our techniques the user can define what are the synchronization checkpoints based on the hardware support for recovery. In this work we evaluate two levels of hardware or system support: memory without support for checkpointing and rollback, where memory is guaranteed to not be corrupted with wrong values and memory with low-cost support for checkpointing and rollback. We also consider the situation where register file is protected with parity or ECC, such as Intel Itanium, Sun UltraSPARC and IBM Power4-6 because software implementations can take advantage of this hardware feature and reduce some of the replicated instructions. We have evaluated our approach using LLVM as our compiler infrastructure and PIN for fault injection. Our experimental results with Spec benchmarks on a Pentium 4 show that in the case where memory is guaranteed not to be corrupted, performance improves by an average 6.2%. With more support for checkpoint performance improves by an average 14.7%. A software fault tolerant system that takes advantage of the register safe platforms improves by an average 16.0%. Fault injection experiments show that our techniques do not decrease fault coverage, although they slightly increase the number of segmentation faults.
UR - http://www.scopus.com/inward/record.url?scp=47849096384&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=47849096384&partnerID=8YFLogxK
U2 - 10.1109/PACT.2007.4336261
DO - 10.1109/PACT.2007.4336261
M3 - Conference contribution
AN - SCOPUS:47849096384
SN - 0769529445
SN - 9780769529448
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 433
BT - 16th International Conference on Parallel Architecture and Compilation Techniques, PACT 2007
T2 - 16th International Conference on Parallel Architecture and Compilation Techniques, PACT 2007
Y2 - 15 September 2007 through 19 September 2007
ER -