Fault-tolerant high-performance matrix multiplication: Theory and practice

John A. Gunnels, Daniel S. Katz, Enrique S. Quintana-Ortí, Robert A. Van de Geijn

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper, we extend the theory and practice regaming algorithmic fault-tolerant matrix-matrix multiplication, C = AB, in a number of ways. First, we propose low-overhead methods for detecting errors introduced not only in C but also in A and/or B. Second, we show that, theoretically, these methods will detect all errors as long as only one entry is corrupted. Third, we propose a low-overhead roll-back approach to correct errors once detected. Finally, we give a high-performance implementation of matrix-matrix multiplication that incorporates these error detection and correction methods. Empirical results demonstrate that these methods work well in practice while imposing an acceptable level of overhead relative to highperformance implementations without fault-tolerance.

Original languageEnglish (US)
Title of host publicationProceedings of the International Conference on Dependable Systems and Networks
EditorsD.C. Young, D.C. Young
Pages47-56
Number of pages10
DOIs
StatePublished - 2001
Externally publishedYes
EventProceedings of the International Conference on Dependable Systems and Networks - Goteborg, Sweden
Duration: Jul 1 2001Jul 4 2001

Publication series

NameProceedings of the International Conference on Dependable Systems and Networks

Other

OtherProceedings of the International Conference on Dependable Systems and Networks
Country/TerritorySweden
CityGoteborg
Period7/1/017/4/01

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Fingerprint

Dive into the research topics of 'Fault-tolerant high-performance matrix multiplication: Theory and practice'. Together they form a unique fingerprint.

Cite this