TY - GEN
T1 - SparseTrain
T2 - 2020 ACM International Conference on Parallel Architectures and Compilation Techniques, PACT 2020
AU - Gong, Zhangxiaowen
AU - Ji, Houxiang
AU - Fletcher, Christopher W.
AU - Hughes, Christopher J.
AU - Torrellas, Josep
N1 - Publisher Copyright:
© 2020 Association for Computing Machinery.
PY - 2020/9/30
Y1 - 2020/9/30
N2 - Our community has improved the efficiency of deep learning applications by exploiting sparsity in inputs. Most of that work, though,is for inference, where weight sparsity is known statically, and/orfor specialized hardware. In this paper, we propose SparseTrain, asoftware-only scheme to leverage dynamic sparsity during trainingon general-purpose SIMD processors. SparseTrain exploits zerosintroduced by the ReLU activation function to both feature mapsand their gradients. Exploiting such sparsity is challenging becausethe sparsity degree is moderate and the locations of zeros changeover time.SparseTrain identifies zeros in a dense data representation andperforms vectorized computation. Variations of the scheme are applicable to all major components of training: forward propagation,backward propagation by inputs, and backward propagation byweights. Our experiments on a 6-core Intel Skylake-X server showthat SparseTrain is very effective. In end-to-end training of VGG16,ResNet-34, and ResNet-50 with ImageNet, SparseTrain outperformsa highly-optimized direct convolution on the non-initial convolutional layers by 2.19x, 1.37x, and 1.31x, respectively. SparseTrainalso benefits inference. It accelerates the non-initial convolutionallayers of the aforementioned models by 1.88x, 1.64x, and 1.44x,respectively.
AB - Our community has improved the efficiency of deep learning applications by exploiting sparsity in inputs. Most of that work, though,is for inference, where weight sparsity is known statically, and/orfor specialized hardware. In this paper, we propose SparseTrain, asoftware-only scheme to leverage dynamic sparsity during trainingon general-purpose SIMD processors. SparseTrain exploits zerosintroduced by the ReLU activation function to both feature mapsand their gradients. Exploiting such sparsity is challenging becausethe sparsity degree is moderate and the locations of zeros changeover time.SparseTrain identifies zeros in a dense data representation andperforms vectorized computation. Variations of the scheme are applicable to all major components of training: forward propagation,backward propagation by inputs, and backward propagation byweights. Our experiments on a 6-core Intel Skylake-X server showthat SparseTrain is very effective. In end-to-end training of VGG16,ResNet-34, and ResNet-50 with ImageNet, SparseTrain outperformsa highly-optimized direct convolution on the non-initial convolutional layers by 2.19x, 1.37x, and 1.31x, respectively. SparseTrainalso benefits inference. It accelerates the non-initial convolutionallayers of the aforementioned models by 1.88x, 1.64x, and 1.44x,respectively.
KW - CPU
KW - Convolution
KW - Deep neural networks
KW - Sparsity
KW - Training
UR - http://www.scopus.com/inward/record.url?scp=85094185472&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85094185472&partnerID=8YFLogxK
U2 - 10.1145/3410463.3414655
DO - 10.1145/3410463.3414655
M3 - Conference contribution
AN - SCOPUS:85094185472
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 279
EP - 292
BT - PACT 2020 - Proceedings of the ACM International Conference on Parallel Architectures and Compilation Techniques
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 3 October 2020 through 7 October 2020
ER -