TY - GEN
T1 - Optimizing Selective Protection for CNN Resilience
AU - Mahmoud, Abdulrahman
AU - Sastry Hari, Siva Kumar
AU - Fletcher, Christopher W.
AU - Adve, Sarita V.
AU - Sakr, Charbel
AU - Shanbhag, Naresh
AU - Molchanov, Pavlo
AU - Sullivan, Michael B.
AU - Tsai, Timothy
AU - Keckler, Stephen W.
N1 - Publisher Copyright:
© 2021 IEEE.
PY - 2021
Y1 - 2021
N2 - As CNNs are being extensively employed in high performance and safety-critical applications that demand high reliability, it is important to ensure that they are resilient to transient hardware errors. Traditional full redundancy solutions provide high error coverage, but the associated overheads are often prohibitively high for resource-constrained systems. In this work, we propose software-directed selective protection techniques to target the most vulnerable work in a CNN, providing a low-cost solution. We propose and evaluate two domain-specific selective protection techniques for CNNs that target different granularities. First, we develop a feature-map level resilience technique (FLR), which identifies and statically protects the most vulnerable feature maps in a CNN. Second, we develop an inference level resilience technique (ILR), which selectively reruns vulnerable inferences by analyzing their output. Third, we show that the combination of both techniques (FILR) is highly efficient, achieving nearly full error coverage (99.78% on average) for quantized inferences via selective protection. Our tunable approach enables developers to evaluate CNN resilience to hardware errors before deployment using MAC operations as overhead for quicker trade-off analysis. For example, targeting 100% error coverage on ResNet50 with FILR requires 20.8% additional MACs, while measurements on a Jetson Xavier GPU shows 4.6% runtime overhead.
AB - As CNNs are being extensively employed in high performance and safety-critical applications that demand high reliability, it is important to ensure that they are resilient to transient hardware errors. Traditional full redundancy solutions provide high error coverage, but the associated overheads are often prohibitively high for resource-constrained systems. In this work, we propose software-directed selective protection techniques to target the most vulnerable work in a CNN, providing a low-cost solution. We propose and evaluate two domain-specific selective protection techniques for CNNs that target different granularities. First, we develop a feature-map level resilience technique (FLR), which identifies and statically protects the most vulnerable feature maps in a CNN. Second, we develop an inference level resilience technique (ILR), which selectively reruns vulnerable inferences by analyzing their output. Third, we show that the combination of both techniques (FILR) is highly efficient, achieving nearly full error coverage (99.78% on average) for quantized inferences via selective protection. Our tunable approach enables developers to evaluate CNN resilience to hardware errors before deployment using MAC operations as overhead for quicker trade-off analysis. For example, targeting 100% error coverage on ResNet50 with FILR requires 20.8% additional MACs, while measurements on a Jetson Xavier GPU shows 4.6% runtime overhead.
KW - Convolutional Neural Networks (CNNs)
KW - Errors
KW - Reliability
KW - Silent Data Corruptions (SDC)
KW - Software directed
KW - Vulnerability
UR - http://www.scopus.com/inward/record.url?scp=85126396528&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85126396528&partnerID=8YFLogxK
U2 - 10.1109/ISSRE52982.2021.00025
DO - 10.1109/ISSRE52982.2021.00025
M3 - Conference contribution
AN - SCOPUS:85126396528
T3 - Proceedings - International Symposium on Software Reliability Engineering, ISSRE
SP - 127
EP - 138
BT - Proceedings - 2021 IEEE 32nd International Symposium on Software Reliability Engineering, ISSRE 2021
A2 - Jin, Zhi
A2 - Li, Xuandong
A2 - Xiang, Jianwen
A2 - Mariani, Leonardo
A2 - Liu, Ting
A2 - Yu, Xiao
A2 - Ivaki, Nahgmeh
PB - IEEE Computer Society
T2 - 32nd IEEE International Symposium on Software Reliability Engineering, ISSRE 2021
Y2 - 25 October 2021 through 28 October 2021
ER -