TY - GEN
T1 - Fault-tolerant high-performance matrix multiplication
T2 - Proceedings of the International Conference on Dependable Systems and Networks
AU - Gunnels, John A.
AU - Katz, Daniel S.
AU - Quintana-Ortí, Enrique S.
AU - Van de Geijn, Robert A.
N1 - Copyright:
Copyright 2010 Elsevier B.V., All rights reserved.
PY - 2001
Y1 - 2001
N2 - In this paper, we extend the theory and practice regaming algorithmic fault-tolerant matrix-matrix multiplication, C = AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to highperformance implementations without fault-tolerance.
AB - In this paper, we extend the theory and practice regaming algorithmic fault-tolerant matrix-matrix multiplication, C = AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to highperformance implementations without fault-tolerance.
UR - http://www.scopus.com/inward/record.url?scp=0035789206&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0035789206&partnerID=8YFLogxK
U2 - 10.1109/DSN.2001.941390
DO - 10.1109/DSN.2001.941390
M3 - Conference contribution
AN - SCOPUS:0035789206
SN - 0769511015
SN - 9780769511016
T3 - Proceedings of the International Conference on Dependable Systems and Networks
SP - 47
EP - 56
BT - Proceedings of the International Conference on Dependable Systems and Networks
A2 - Young, D.C.
A2 - Young, D.C.
Y2 - 1 July 2001 through 4 July 2001
ER -