TY - GEN
T1 - Channel and filter parallelism for large-scale CNN training
AU - Dryden, Nikoli
AU - Maruyama, Naoya
AU - Moon, Tim
AU - Benson, Tom
AU - Snir, Marc
AU - Van Essen, Brian
N1 - Publisher Copyright:
© 2019 ACM.
PY - 2019/11/17
Y1 - 2019/11/17
N2 - Accelerating large-scale CNN training is needed to keep training times reasonable as datasets grow larger and models become more complex. Existing frameworks primarily scale using data-parallelism, but this is limited by the mini-batch size, which cannot grow arbitrarily. We introduce three algorithms that partition channel or filter data to exploit parallelism beyond the sample dimension. Further, they partition the parameters of convolutional layers, replacing global all reduces with segmented allreduces - -smaller, concurrent allreduces among disjoint processor sets. These algorithms enable strong scaling, reduced communication overhead, and reduced memory pressure, enabling training of very wide CNNs. We demonstrate improved strong and weak scaling, including up to 4.1x reductions in training time for residual networks and 4x reductions in allreduce overhead. We also show that wider models provide improved accuracy on ImageNet. We study the current limitations of our algorithms and provide a direction for future optimizations of large-scale deep learning frameworks.
AB - Accelerating large-scale CNN training is needed to keep training times reasonable as datasets grow larger and models become more complex. Existing frameworks primarily scale using data-parallelism, but this is limited by the mini-batch size, which cannot grow arbitrarily. We introduce three algorithms that partition channel or filter data to exploit parallelism beyond the sample dimension. Further, they partition the parameters of convolutional layers, replacing global all reduces with segmented allreduces - -smaller, concurrent allreduces among disjoint processor sets. These algorithms enable strong scaling, reduced communication overhead, and reduced memory pressure, enabling training of very wide CNNs. We demonstrate improved strong and weak scaling, including up to 4.1x reductions in training time for residual networks and 4x reductions in allreduce overhead. We also show that wider models provide improved accuracy on ImageNet. We study the current limitations of our algorithms and provide a direction for future optimizations of large-scale deep learning frameworks.
KW - Algorithms
KW - CNN
KW - Convolution
KW - Deep learning
KW - Scaling
UR - http://www.scopus.com/inward/record.url?scp=85076153623&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85076153623&partnerID=8YFLogxK
U2 - 10.1145/3295500.3356207
DO - 10.1145/3295500.3356207
M3 - Conference contribution
AN - SCOPUS:85076153623
T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC
BT - Proceedings of SC 2019
PB - IEEE Computer Society
T2 - 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019
Y2 - 17 November 2019 through 22 November 2019
ER -