Abstract
Proposed here is a novel architecture for a fault- tolerant multiprocessor environment. It is assumed that the multiprocessor organization consists of a pool of active processing modules and either a small number of spare modules or active modules with some spare processing capacity. A fault-tolerance scheme is developed for duplex systems using checkpoints. Our scheme, unlike traditional checkpointing schemes, requires no rollbacks for recovering from single faults. The objective here is to achieve performance of a Triple Modular Redundant system using duplex system redundancy. In the proposed scheme, at each checkpoint, the state of the two modules executing the task is compared for detection of faults. If a disagreement occurs, indicating a fault, the two differing states are both stored. Instead of performing usual rollback and retry, the following mechanism is used. The state at the preceding checkpoint, where both processing modules had agreed, is loaded into a spare module. The checkpoint interval in which the failure is detected is then “retried" on the spare module. Concurrently, the task continues forward on the two active modules, beyond the checkpoint where the disagreement occurred. At the next checkpoint, the state of the spare is compared with the stored states of the two active modules (stored states correspond to where the disagreement occurred). The active module which disagrees with the spare is identified to be faulty. Once the faulty module is identified, the state of the faulty module is restored to the correct state by copying the state from the other active module, which is fault-free. The spare is released to the pool after recovery is completed. It is important to note that the spare is shared among many processor pairs and is used temporarily when faults occur. Since the above mechanism achieves forward recovery, the proposed scheme is termed Roll-Forward Checkpointing Scheme (RFCS). The RFCS scheme allows recovery from single failures without the overhead of rollback. The advantage of the proposed scheme is that it achieves a lower average execution time with a lower variance as compared to the rollback scheme. This can be crucial for Real time systems since lower variance enhances the predictability of the task completion time.
Original language | English (US) |
---|---|
Pages (from-to) | 1163-1174 |
Number of pages | 12 |
Journal | IEEE Transactions on Computers |
Volume | 43 |
Issue number | 10 |
DOIs | |
State | Published - Oct 1994 |
Externally published | Yes |
Keywords
- Checkpointing
- duplex systems
- forward recovery
- nondedicated spares
- transient faults
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computational Theory and Mathematics