Abstract
The effectiveness of sparse, linear solvers is typically studied in terms of their convergence properties and computational complexity, while their ability to handle transient hardware errors, such as bit-flips that lead to silent data corruption (SDC), has received less attention. As supercomputers continue to add more cores to increase performance, they are also becoming more susceptible to SDC. Consequently, understanding the impact of SDC on algorithms and common applications is an important component of solver analysis. In this paper, we investigate algebraic multigrid (AMG) in an environment exposed to corruptions through bit-flips. We propose an algorithmic based detection and recovery scheme that maintains the numerical properties of AMG, while maintaining high convergence rates in this environment. We also introduce a performance model and numerical results in support of the methodology.
Original language | English (US) |
---|---|
Pages (from-to) | 1-8 |
Number of pages | 8 |
Journal | Simulation Series |
Volume | 47 |
Issue number | 4 |
State | Published - 2015 |
Event | 23rd High Performance Computing Symposium, HPC 2015, Part of the 2015 Spring Simulation Multi-Conference, SpringSim 2015 - Alexandria, United States Duration: Apr 12 2015 → Apr 15 2015 |
Keywords
- Algebraic multigrid
- Fault tolerance
- Resilience
- Silent data corruption
ASJC Scopus subject areas
- Computer Networks and Communications