Quantifying the Impact of Memory Errors in Deep Learning

Zhao Zhang, Lei Huang, Ruizhu Huang, Weijia Xu, Daniel S. Katz

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The use of deep learning (DL) on HPC resources has become common as scientists explore and exploit DL methods to solve domain problems. On the other hand, in the coming exascale computing era, a high error rate is expected to be problematic for most HPC applications. The impact of errors on DL applications, especially DL training, remains unclear given their stochastic nature. In this paper, we focus on understanding DL training applications on HPC in the presence of silent data corruption. Specifically, we design and perform a quantification study with three representative applications by manually injecting silent data corruption errors (SDCs) across the design space and compare training results with the error-free baseline. The results show only 0.61-1.76% of SDCs cause training failures, and taking the SDC rate in modern hardware into account, the actual chance of a failure is one in thousands to millions of executions. With this quantitatively measured impact, computing centers can make rational design decisions based on their application portfolio, the acceptable failure rate, and financial constraints; for example, they might determine their confidence in the correctness of training results performed on processors without error correction code (ECC) RAM. We also discover that over 75-90% of the SDCs that cause catastrophic errors can be easily detected by a training loss in the next iteration. Thus we propose this error-aware software solution to correct catastrophic errors, as it has significantly lower time and space overhead compared to algorithm-based fault-tolerance (ABFT) and ECC.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE International Conference on Cluster Computing, CLUSTER 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781728147345
DOIs
StatePublished - Sep 2019
Event2019 IEEE International Conference on Cluster Computing, CLUSTER 2019 - Albuquerque, United States
Duration: Sep 23 2019Sep 26 2019

Publication series

NameProceedings - IEEE International Conference on Cluster Computing, ICCC
Volume2019-September
ISSN (Print)1552-5244

Conference

Conference2019 IEEE International Conference on Cluster Computing, CLUSTER 2019
Country/TerritoryUnited States
CityAlbuquerque
Period9/23/199/26/19

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Signal Processing

Fingerprint

Dive into the research topics of 'Quantifying the Impact of Memory Errors in Deep Learning'. Together they form a unique fingerprint.

Cite this