TY - GEN
T1 - Algorithmic approaches to low overhead fault detection for sparse linear algebra
AU - Sloan, Joseph
AU - Kumar, Rakesh
AU - Bronevetsky, Greg
PY - 2012
Y1 - 2012
N2 - The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm - Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that pre-conditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2× reductions in performance overhead over traditional ABFT checks for a spectrum of sparse problems. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.
AB - The increasing size and complexity of High-Performance Computing systems is making it increasingly likely that individual circuits will produce erroneous results, especially when operated in a low energy mode. Previous techniques for Algorithm - Based Fault Tolerance (ABFT) [20] have been proposed for detecting errors in dense linear operations, but have high overhead in the context of sparse problems. In this paper, we propose a set of algorithmic techniques that minimize the overhead of fault detection for sparse problems. The techniques are based on two insights. First, many sparse problems are well structured (e.g. diagonal, banded diagonal, block diagonal), which allows for sampling techniques to produce good approximations of the checks used for fault detection. These approximate checks may be acceptable for many sparse linear algebra applications. Second, many linear applications have enough reuse that pre-conditioning techniques can be used to make these applications more amenable to low-cost algorithmic checks. The proposed techniques are shown to yield up to 2× reductions in performance overhead over traditional ABFT checks for a spectrum of sparse problems. A case study using common linear solvers further illustrates the benefits of the proposed algorithmic techniques.
KW - ABFT
KW - error detection
KW - numerical methods
KW - sparse linear algebra
UR - http://www.scopus.com/inward/record.url?scp=84866720696&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84866720696&partnerID=8YFLogxK
U2 - 10.1109/DSN.2012.6263938
DO - 10.1109/DSN.2012.6263938
M3 - Conference contribution
AN - SCOPUS:84866720696
SN - 9781467316248
T3 - Proceedings of the International Conference on Dependable Systems and Networks
BT - 2012 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012
T2 - 42nd Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN 2012
Y2 - 25 June 2012 through 28 June 2012
ER -