Abstract

The effectiveness of sparse, linear solvers is typically studied in terms of their convergence properties and computational complexity, while their ability to handle transient hardware errors, such as bit-flips that lead to silent data corruption (SDC), has received less attention. As supercomputers continue to add more cores to increase performance, they are also becoming more susceptible to SDC. Consequently, understanding the impact of SDC on algorithms and common applications is an important component of solver analysis. In this paper, we investigate algebraic multigrid (AMG) in an environment exposed to corruptions through bit-flips. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG, while maintaining high convergence rates in this environment. We also introduce a performance model and numerical results in support of the methodology.

Original languageEnglish (US)
Pages (from-to)1-8
Number of pages8
JournalSimulation Series
Volume47
Issue number4
StatePublished - Jan 1 2015
Event23rd High Performance Computing Symposium, HPC 2015, Part of the 2015 Spring Simulation Multi-Conference, SpringSim 2015 - Alexandria, United States
Duration: Apr 12 2015Apr 15 2015

Fingerprint

Supercomputers
Computational complexity
Hardware
Recovery

Keywords

  • Algebraic multigrid
  • Fault tolerance
  • Resilience
  • Silent data corruption

ASJC Scopus subject areas

  • Computer Networks and Communications

Cite this

Towards a more fault resilient multigrid solver. / Calhoun, Jon; Olson, Luke; Snir, Marc; Gropp, William D.

In: Simulation Series, Vol. 47, No. 4, 01.01.2015, p. 1-8.

Research output: Contribution to journalConference article

@article{ad803f44e4294aa68d8a56e5be7f37fd,
title = "Towards a more fault resilient multigrid solver",
abstract = "The effectiveness of sparse, linear solvers is typically studied in terms of their convergence properties and computational complexity, while their ability to handle transient hardware errors, such as bit-flips that lead to silent data corruption (SDC), has received less attention. As supercomputers continue to add more cores to increase performance, they are also becoming more susceptible to SDC. Consequently, understanding the impact of SDC on algorithms and common applications is an important component of solver analysis. In this paper, we investigate algebraic multigrid (AMG) in an environment exposed to corruptions through bit-flips. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG, while maintaining high convergence rates in this environment. We also introduce a performance model and numerical results in support of the methodology.",
keywords = "Algebraic multigrid, Fault tolerance, Resilience, Silent data corruption",
author = "Jon Calhoun and Luke Olson and Marc Snir and Gropp, {William D}",
year = "2015",
month = "1",
day = "1",
language = "English (US)",
volume = "47",
pages = "1--8",
journal = "Simulation Series",
issn = "0735-9276",
number = "4",

}

TY - JOUR

T1 - Towards a more fault resilient multigrid solver

AU - Calhoun, Jon

AU - Olson, Luke

AU - Snir, Marc

AU - Gropp, William D

PY - 2015/1/1

Y1 - 2015/1/1

N2 - The effectiveness of sparse, linear solvers is typically studied in terms of their convergence properties and computational complexity, while their ability to handle transient hardware errors, such as bit-flips that lead to silent data corruption (SDC), has received less attention. As supercomputers continue to add more cores to increase performance, they are also becoming more susceptible to SDC. Consequently, understanding the impact of SDC on algorithms and common applications is an important component of solver analysis. In this paper, we investigate algebraic multigrid (AMG) in an environment exposed to corruptions through bit-flips. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG, while maintaining high convergence rates in this environment. We also introduce a performance model and numerical results in support of the methodology.

AB - The effectiveness of sparse, linear solvers is typically studied in terms of their convergence properties and computational complexity, while their ability to handle transient hardware errors, such as bit-flips that lead to silent data corruption (SDC), has received less attention. As supercomputers continue to add more cores to increase performance, they are also becoming more susceptible to SDC. Consequently, understanding the impact of SDC on algorithms and common applications is an important component of solver analysis. In this paper, we investigate algebraic multigrid (AMG) in an environment exposed to corruptions through bit-flips. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG, while maintaining high convergence rates in this environment. We also introduce a performance model and numerical results in support of the methodology.

KW - Algebraic multigrid

KW - Fault tolerance

KW - Resilience

KW - Silent data corruption

UR - http://www.scopus.com/inward/record.url?scp=84937393830&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84937393830&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:84937393830

VL - 47

SP - 1

EP - 8

JO - Simulation Series

JF - Simulation Series

SN - 0735-9276

IS - 4

ER -