Optimizing Selective Protection for CNN Resilience

Abdulrahman Mahmoud, Siva Kumar Sastry Hari, Christopher Wardlaw Fletcher, Sarita V. Adve, Charbel Sakr, Naresh Shanbhag, Pavlo Molchanov, Michael B. Sullivan, Timothy Tsai, Stephen W. Keckler

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As CNNs are being extensively employed in high performance and safety-critical applications that demand high reliability, it is important to ensure that they are resilient to transient hardware errors. Traditional full redundancy solutions provide high error coverage, but the associated overheads are often prohibitively high for resource-constrained systems. In this work, we propose software-directed selective protection techniques to target the most vulnerable work in a CNN, providing a low-cost solution. We propose and evaluate two domain-specific selective protection techniques for CNNs that target different granularities. First, we develop a feature-map level resilience technique (FLR), which identifies and statically protects the most vulnerable feature maps in a CNN. Second, we develop an inference level resilience technique (ILR), which selectively reruns vulnerable inferences by analyzing their output. Third, we show that the combination of both techniques (FILR) is highly efficient, achieving nearly full error coverage (99.78% on average) for quantized inferences via selective protection. Our tunable approach enables developers to evaluate CNN resilience to hardware errors before deployment using MAC operations as overhead for quicker trade-off analysis. For example, targeting 100% error coverage on ResNet50 with FILR requires 20.8% additional MACs, while measurements on a Jetson Xavier GPU shows 4.6% runtime overhead.

Original languageEnglish (US)
Title of host publicationProceedings - 2021 IEEE 32nd International Symposium on Software Reliability Engineering, ISSRE 2021
EditorsZhi Jin, Xuandong Li, Jianwen Xiang, Leonardo Mariani, Ting Liu, Xiao Yu, Nahgmeh Ivaki
PublisherIEEE Computer Society
Pages127-138
Number of pages12
ISBN (Electronic)9781665425872
DOIs
StatePublished - 2021
Event32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021 - Wuhan, China
Duration: Oct 25 2021Oct 28 2021

Publication series

NameProceedings - International Symposium on Software Reliability Engineering, ISSRE
Volume2021-October
ISSN (Print)1071-9458

Conference

Conference32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021
Country/TerritoryChina
CityWuhan
Period10/25/2110/28/21

Keywords

  • Convolutional Neural Networks (CNNs)
  • Errors
  • Reliability
  • Silent Data Corruptions (SDC)
  • Software directed
  • Vulnerability

ASJC Scopus subject areas

  • Software
  • Safety, Risk, Reliability and Quality

Fingerprint

Dive into the research topics of 'Optimizing Selective Protection for CNN Resilience'. Together they form a unique fingerprint.

Cite this