Abstract

The effectiveness of sparse, linear solvers is typically studied in terms of their convergence properties and computational complexity, while their ability to handle transient hardware errors, such as bit-flips that lead to silent data corruption (SDC), has received less attention. As supercomputers continue to add more cores to increase performance, they are also becoming more susceptible to SDC. Consequently, understanding the impact of SDC on algorithms and common applications is an important component of solver analysis. In this paper, we investigate algebraic multigrid (AMG) in an environment exposed to corruptions through bit-flips. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG, while maintaining high convergence rates in this environment. We also introduce a performance model and numerical results in support of the methodology.

Original languageEnglish (US)
Pages (from-to)1-8
Number of pages8
JournalSimulation Series
Volume47
Issue number4
StatePublished - Jan 1 2015
Event23rd High Performance Computing Symposium, HPC 2015, Part of the 2015 Spring Simulation Multi-Conference, SpringSim 2015 - Alexandria, United States
Duration: Apr 12 2015Apr 15 2015

Keywords

  • Algebraic multigrid
  • Fault tolerance
  • Resilience
  • Silent data corruption

ASJC Scopus subject areas

  • Computer Networks and Communications

Fingerprint Dive into the research topics of 'Towards a more fault resilient multigrid solver'. Together they form a unique fingerprint.

  • Cite this