Tests and tolerances for high-performance software-implemented fault detection

Michael Turmon, Robert Granat, Daniel S. Katz, John Z. Lou

Research output: Contribution to journalArticlepeer-review

Abstract

We describe and test a software approach to fault detection in common numerical algorithms. Such result checking or algorithm-based fault tolerance (ABFT) methods may be used, for example, to overcome single-event upsets in computational hardware or to detect errors in complex, high-efficiency implementations of the algorithms. Following earlier work, we use checksum methods to validate results returned by a numerical subroutine operating subject to unpredictable errors in data. We consider common matrix and Fourier algorithms which return results satisfying a necessary condition having a linear form; the checksum tests compliance with this condition. We discuss the theory and practice of setting numerical tolerances to separate errors caused by a fault from those inherent in finite-precision floating-point calculations. We concentrate on comprehensively defining and evaluating tests having various accuracy/computational burden tradeoffs, and we emphasize average-case algorithm behavior rather than using worst-case upper bounds on error.

Original languageEnglish (US)
Pages (from-to)579-591
Number of pages13
JournalIEEE Transactions on Computers
Volume52
Issue number5
DOIs
StatePublished - May 2003
Externally publishedYes

Keywords

  • Aerospace
  • Algorithm-based fault tolerance
  • Error analysis
  • Parallel numerical algorithms
  • Result checking

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture
  • Computational Theory and Mathematics

Fingerprint

Dive into the research topics of 'Tests and tolerances for high-performance software-implemented fault detection'. Together they form a unique fingerprint.

Cite this