Accumulation bit-width scaling for ultra-low precision training of deep networks

Charbel Sakr, Naigang Wang, Chia Yu Chen, Jungwook Choi, Ankur Agrawal, Naresh R Shanbhag, Kailash Gopalakrishnan

Research output: Contribution to conferencePaper

Abstract

Efforts to reduce the numerical precision of computations in deep learning training have yielded systems that aggressively quantize weights and activations, yet employ wide high-precision accumulators for partial sums in inner-product operations to preserve the quality of convergence. The absence of any framework to analyze the precision requirements of partial sum accumulations results in conservative design choices. This imposes an upper-bound on the reduction of complexity of multiply-accumulate units. We present a statistical approach to analyze the impact of reduced accumulation precision on deep learning training. Observing that a bad choice for accumulation precision results in loss of information that manifests itself as a reduction in variance in an ensemble of partial sums, we derive a set of equations that relate this variance to the length of accumulation and the minimum number of bits needed for accumulation. We apply our analysis to three benchmark networks: CIFAR-10 ResNet 32, ImageNet ResNet 18 and ImageNet AlexNet. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision floating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. Overall this analysis enables precise tailoring of computation hardware to the application, yielding area- and power-optimal systems.

Original languageEnglish (US)
StatePublished - Jan 1 2019
Event7th International Conference on Learning Representations, ICLR 2019 - New Orleans, United States
Duration: May 6 2019May 9 2019

Conference

Conference7th International Conference on Learning Representations, ICLR 2019
CountryUnited States
CityNew Orleans
Period5/6/195/9/19

Fingerprint

scaling
Optimal systems
Chemical activation
Hardware
floating
Deep learning
Scaling
activation
learning
hardware
Equations

ASJC Scopus subject areas

  • Education
  • Computer Science Applications
  • Linguistics and Language
  • Language and Linguistics

Cite this

Sakr, C., Wang, N., Chen, C. Y., Choi, J., Agrawal, A., Shanbhag, N. R., & Gopalakrishnan, K. (2019). Accumulation bit-width scaling for ultra-low precision training of deep networks. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

Accumulation bit-width scaling for ultra-low precision training of deep networks. / Sakr, Charbel; Wang, Naigang; Chen, Chia Yu; Choi, Jungwook; Agrawal, Ankur; Shanbhag, Naresh R; Gopalakrishnan, Kailash.

2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.

Research output: Contribution to conferencePaper

Sakr, C, Wang, N, Chen, CY, Choi, J, Agrawal, A, Shanbhag, NR & Gopalakrishnan, K 2019, 'Accumulation bit-width scaling for ultra-low precision training of deep networks' Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States, 5/6/19 - 5/9/19, .
Sakr C, Wang N, Chen CY, Choi J, Agrawal A, Shanbhag NR et al. Accumulation bit-width scaling for ultra-low precision training of deep networks. 2019. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
Sakr, Charbel ; Wang, Naigang ; Chen, Chia Yu ; Choi, Jungwook ; Agrawal, Ankur ; Shanbhag, Naresh R ; Gopalakrishnan, Kailash. / Accumulation bit-width scaling for ultra-low precision training of deep networks. Paper presented at 7th International Conference on Learning Representations, ICLR 2019, New Orleans, United States.
@conference{eb61f276a98749d482f6757b466c47d5,
title = "Accumulation bit-width scaling for ultra-low precision training of deep networks",
abstract = "Efforts to reduce the numerical precision of computations in deep learning training have yielded systems that aggressively quantize weights and activations, yet employ wide high-precision accumulators for partial sums in inner-product operations to preserve the quality of convergence. The absence of any framework to analyze the precision requirements of partial sum accumulations results in conservative design choices. This imposes an upper-bound on the reduction of complexity of multiply-accumulate units. We present a statistical approach to analyze the impact of reduced accumulation precision on deep learning training. Observing that a bad choice for accumulation precision results in loss of information that manifests itself as a reduction in variance in an ensemble of partial sums, we derive a set of equations that relate this variance to the length of accumulation and the minimum number of bits needed for accumulation. We apply our analysis to three benchmark networks: CIFAR-10 ResNet 32, ImageNet ResNet 18 and ImageNet AlexNet. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision floating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. Overall this analysis enables precise tailoring of computation hardware to the application, yielding area- and power-optimal systems.",
author = "Charbel Sakr and Naigang Wang and Chen, {Chia Yu} and Jungwook Choi and Ankur Agrawal and Shanbhag, {Naresh R} and Kailash Gopalakrishnan",
year = "2019",
month = "1",
day = "1",
language = "English (US)",
note = "7th International Conference on Learning Representations, ICLR 2019 ; Conference date: 06-05-2019 Through 09-05-2019",

}

TY - CONF

T1 - Accumulation bit-width scaling for ultra-low precision training of deep networks

AU - Sakr, Charbel

AU - Wang, Naigang

AU - Chen, Chia Yu

AU - Choi, Jungwook

AU - Agrawal, Ankur

AU - Shanbhag, Naresh R

AU - Gopalakrishnan, Kailash

PY - 2019/1/1

Y1 - 2019/1/1

N2 - Efforts to reduce the numerical precision of computations in deep learning training have yielded systems that aggressively quantize weights and activations, yet employ wide high-precision accumulators for partial sums in inner-product operations to preserve the quality of convergence. The absence of any framework to analyze the precision requirements of partial sum accumulations results in conservative design choices. This imposes an upper-bound on the reduction of complexity of multiply-accumulate units. We present a statistical approach to analyze the impact of reduced accumulation precision on deep learning training. Observing that a bad choice for accumulation precision results in loss of information that manifests itself as a reduction in variance in an ensemble of partial sums, we derive a set of equations that relate this variance to the length of accumulation and the minimum number of bits needed for accumulation. We apply our analysis to three benchmark networks: CIFAR-10 ResNet 32, ImageNet ResNet 18 and ImageNet AlexNet. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision floating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. Overall this analysis enables precise tailoring of computation hardware to the application, yielding area- and power-optimal systems.

AB - Efforts to reduce the numerical precision of computations in deep learning training have yielded systems that aggressively quantize weights and activations, yet employ wide high-precision accumulators for partial sums in inner-product operations to preserve the quality of convergence. The absence of any framework to analyze the precision requirements of partial sum accumulations results in conservative design choices. This imposes an upper-bound on the reduction of complexity of multiply-accumulate units. We present a statistical approach to analyze the impact of reduced accumulation precision on deep learning training. Observing that a bad choice for accumulation precision results in loss of information that manifests itself as a reduction in variance in an ensemble of partial sums, we derive a set of equations that relate this variance to the length of accumulation and the minimum number of bits needed for accumulation. We apply our analysis to three benchmark networks: CIFAR-10 ResNet 32, ImageNet ResNet 18 and ImageNet AlexNet. In each case, with accumulation precision set in accordance with our proposed equations, the networks successfully converge to the single precision floating-point baseline. We also show that reducing accumulation precision further degrades the quality of the trained network, proving that our equations produce tight bounds. Overall this analysis enables precise tailoring of computation hardware to the application, yielding area- and power-optimal systems.

UR - http://www.scopus.com/inward/record.url?scp=85071160073&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071160073&partnerID=8YFLogxK

M3 - Paper

AN - SCOPUS:85071160073

ER -