Channel and filter parallelism for large-scale CNN training

Nikoli Dryden, Naoya Maruyama, Tim Moon, Tom Benson, Marc Snir, Brian Van Essen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Accelerating large-scale CNN training is needed to keep training times reasonable as datasets grow larger and models become more complex. Existing frameworks primarily scale using data-parallelism, but this is limited by the mini-batch size, which cannot grow arbitrarily. We introduce three algorithms that partition channel or filter data to exploit parallelism beyond the sample dimension. Further, they partition the parameters of convolutional layers, replacing global all reduces with segmented allreduces - -smaller, concurrent allreduces among disjoint processor sets. These algorithms enable strong scaling, reduced communication overhead, and reduced memory pressure, enabling training of very wide CNNs. We demonstrate improved strong and weak scaling, including up to 4.1x reductions in training time for residual networks and 4x reductions in allreduce overhead. We also show that wider models provide improved accuracy on ImageNet. We study the current limitations of our algorithms and provide a direction for future optimizations of large-scale deep learning frameworks.

Original languageEnglish (US)
Title of host publicationProceedings of SC 2019
Subtitle of host publicationThe International Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781450362290
DOIs
StatePublished - Nov 17 2019
Event2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019 - Denver, United States
Duration: Nov 17 2019Nov 22 2019

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019
CountryUnited States
CityDenver
Period11/17/1911/22/19

Fingerprint

Data storage equipment
Communication
Deep learning

Keywords

  • Algorithms
  • CNN
  • Convolution
  • Deep learning
  • Scaling

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture
  • Software

Cite this

Dryden, N., Maruyama, N., Moon, T., Benson, T., Snir, M., & Van Essen, B. (2019). Channel and filter parallelism for large-scale CNN training. In Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis [a10] (International Conference for High Performance Computing, Networking, Storage and Analysis, SC). IEEE Computer Society. https://doi.org/10.1145/3295500.3356207

Channel and filter parallelism for large-scale CNN training. / Dryden, Nikoli; Maruyama, Naoya; Moon, Tim; Benson, Tom; Snir, Marc; Van Essen, Brian.

Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2019. a10 (International Conference for High Performance Computing, Networking, Storage and Analysis, SC).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dryden, N, Maruyama, N, Moon, T, Benson, T, Snir, M & Van Essen, B 2019, Channel and filter parallelism for large-scale CNN training. in Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis., a10, International Conference for High Performance Computing, Networking, Storage and Analysis, SC, IEEE Computer Society, 2019 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2019, Denver, United States, 11/17/19. https://doi.org/10.1145/3295500.3356207
Dryden N, Maruyama N, Moon T, Benson T, Snir M, Van Essen B. Channel and filter parallelism for large-scale CNN training. In Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society. 2019. a10. (International Conference for High Performance Computing, Networking, Storage and Analysis, SC). https://doi.org/10.1145/3295500.3356207
Dryden, Nikoli ; Maruyama, Naoya ; Moon, Tim ; Benson, Tom ; Snir, Marc ; Van Essen, Brian. / Channel and filter parallelism for large-scale CNN training. Proceedings of SC 2019: The International Conference for High Performance Computing, Networking, Storage and Analysis. IEEE Computer Society, 2019. (International Conference for High Performance Computing, Networking, Storage and Analysis, SC).
@inproceedings{fbaa5362cdae4f2e92ab4b9ea0bd9c8c,
title = "Channel and filter parallelism for large-scale CNN training",
abstract = "Accelerating large-scale CNN training is needed to keep training times reasonable as datasets grow larger and models become more complex. Existing frameworks primarily scale using data-parallelism, but this is limited by the mini-batch size, which cannot grow arbitrarily. We introduce three algorithms that partition channel or filter data to exploit parallelism beyond the sample dimension. Further, they partition the parameters of convolutional layers, replacing global all reduces with segmented allreduces - -smaller, concurrent allreduces among disjoint processor sets. These algorithms enable strong scaling, reduced communication overhead, and reduced memory pressure, enabling training of very wide CNNs. We demonstrate improved strong and weak scaling, including up to 4.1x reductions in training time for residual networks and 4x reductions in allreduce overhead. We also show that wider models provide improved accuracy on ImageNet. We study the current limitations of our algorithms and provide a direction for future optimizations of large-scale deep learning frameworks.",
keywords = "Algorithms, CNN, Convolution, Deep learning, Scaling",
author = "Nikoli Dryden and Naoya Maruyama and Tim Moon and Tom Benson and Marc Snir and {Van Essen}, Brian",
year = "2019",
month = "11",
day = "17",
doi = "10.1145/3295500.3356207",
language = "English (US)",
series = "International Conference for High Performance Computing, Networking, Storage and Analysis, SC",
publisher = "IEEE Computer Society",
booktitle = "Proceedings of SC 2019",

}

TY - GEN

T1 - Channel and filter parallelism for large-scale CNN training

AU - Dryden, Nikoli

AU - Maruyama, Naoya

AU - Moon, Tim

AU - Benson, Tom

AU - Snir, Marc

AU - Van Essen, Brian

PY - 2019/11/17

Y1 - 2019/11/17

N2 - Accelerating large-scale CNN training is needed to keep training times reasonable as datasets grow larger and models become more complex. Existing frameworks primarily scale using data-parallelism, but this is limited by the mini-batch size, which cannot grow arbitrarily. We introduce three algorithms that partition channel or filter data to exploit parallelism beyond the sample dimension. Further, they partition the parameters of convolutional layers, replacing global all reduces with segmented allreduces - -smaller, concurrent allreduces among disjoint processor sets. These algorithms enable strong scaling, reduced communication overhead, and reduced memory pressure, enabling training of very wide CNNs. We demonstrate improved strong and weak scaling, including up to 4.1x reductions in training time for residual networks and 4x reductions in allreduce overhead. We also show that wider models provide improved accuracy on ImageNet. We study the current limitations of our algorithms and provide a direction for future optimizations of large-scale deep learning frameworks.

AB - Accelerating large-scale CNN training is needed to keep training times reasonable as datasets grow larger and models become more complex. Existing frameworks primarily scale using data-parallelism, but this is limited by the mini-batch size, which cannot grow arbitrarily. We introduce three algorithms that partition channel or filter data to exploit parallelism beyond the sample dimension. Further, they partition the parameters of convolutional layers, replacing global all reduces with segmented allreduces - -smaller, concurrent allreduces among disjoint processor sets. These algorithms enable strong scaling, reduced communication overhead, and reduced memory pressure, enabling training of very wide CNNs. We demonstrate improved strong and weak scaling, including up to 4.1x reductions in training time for residual networks and 4x reductions in allreduce overhead. We also show that wider models provide improved accuracy on ImageNet. We study the current limitations of our algorithms and provide a direction for future optimizations of large-scale deep learning frameworks.

KW - Algorithms

KW - CNN

KW - Convolution

KW - Deep learning

KW - Scaling

UR - http://www.scopus.com/inward/record.url?scp=85076153623&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85076153623&partnerID=8YFLogxK

U2 - 10.1145/3295500.3356207

DO - 10.1145/3295500.3356207

M3 - Conference contribution

AN - SCOPUS:85076153623

T3 - International Conference for High Performance Computing, Networking, Storage and Analysis, SC

BT - Proceedings of SC 2019

PB - IEEE Computer Society

ER -