Understanding the propagation of error due to a silent data corruption in a sparse matrix vector multiply

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

With the rate of errors that silently effect an application's state/output expected to increase in future HPC machines, numerous mitigation schemes have been proposed, but little work has been done investigating why these schemes detect some error while other is masked. This paper investigates how silent data corruption (SDC) propagates through a sparse matrix vector multiply (SpMV), a fundamental HPC computation kernel. We discover that analyzing the mathematics of the SpMV limits understanding of SDC propagation. We achieve a more complete understanding by investigating how SDC propagates in a SpMV as it is expressed in machine instructions.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE International Conference on Cluster Computing, CLUSTER 2015
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages541-542
Number of pages2
ISBN (Electronic)9781467365987
DOIs
StatePublished - Oct 26 2015
EventIEEE International Conference on Cluster Computing, CLUSTER 2015 - Chicago, United States
Duration: Sep 8 2015Sep 11 2015

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2015-October
ISSN (Print)1552-5244

Other

OtherIEEE International Conference on Cluster Computing, CLUSTER 2015
Country/TerritoryUnited States
CityChicago
Period9/8/159/11/15

Keywords

  • Error Propagation
  • Silent Data Corruption

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'Understanding the propagation of error due to a silent data corruption in a sparse matrix vector multiply'. Together they form a unique fingerprint.

Cite this