Abstract
We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper bounds on error.
Original language | English (US) |
---|---|
Pages (from-to) | 579-591 |
Number of pages | 13 |
Journal | IEEE Transactions on Computers |
Volume | 52 |
Issue number | 5 |
DOIs | |
State | Published - May 2003 |
Externally published | Yes |
Keywords
- Aerospace
- Algorithm-based fault tolerance
- Error analysis
- Parallel numerical algorithms
- Result checking
ASJC Scopus subject areas
- Software
- Theoretical Computer Science
- Hardware and Architecture
- Computational Theory and Mathematics