Improving strong-scaling of CNN training by exploiting finer-grained parallelism

Nikoli Dryden, Naoya Maruyama, Tom Benson, Tim Moon, Marc Snir, Brian Van Essen

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the minibatch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages210-220
Number of pages11
ISBN (Electronic)9781728112466
DOIs
StatePublished - May 2019
Event33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019 - Rio de Janeiro, Brazil
Duration: May 20 2019May 24 2019

Publication series

NameProceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019

Conference

Conference33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019
CountryBrazil
CityRio de Janeiro
Period5/20/195/24/19

Fingerprint

Knobs
Decomposition
Data storage equipment
Image classification
Convolution
Scaling

Keywords

  • Algorithms
  • Convolution
  • Deep learning
  • HPC
  • Performance modeling

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Hardware and Architecture
  • Information Systems and Management

Cite this

Dryden, N., Maruyama, N., Benson, T., Moon, T., Snir, M., & Van Essen, B. (2019). Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019 (pp. 210-220). [8820780] (Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/IPDPS.2019.00031

Improving strong-scaling of CNN training by exploiting finer-grained parallelism. / Dryden, Nikoli; Maruyama, Naoya; Benson, Tom; Moon, Tim; Snir, Marc; Van Essen, Brian.

Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 210-220 8820780 (Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dryden, N, Maruyama, N, Benson, T, Moon, T, Snir, M & Van Essen, B 2019, Improving strong-scaling of CNN training by exploiting finer-grained parallelism. in Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019., 8820780, Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019, Institute of Electrical and Electronics Engineers Inc., pp. 210-220, 33rd IEEE International Parallel and Distributed Processing Symposium, IPDPS 2019, Rio de Janeiro, Brazil, 5/20/19. https://doi.org/10.1109/IPDPS.2019.00031
Dryden N, Maruyama N, Benson T, Moon T, Snir M, Van Essen B. Improving strong-scaling of CNN training by exploiting finer-grained parallelism. In Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 210-220. 8820780. (Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019). https://doi.org/10.1109/IPDPS.2019.00031
Dryden, Nikoli ; Maruyama, Naoya ; Benson, Tom ; Moon, Tim ; Snir, Marc ; Van Essen, Brian. / Improving strong-scaling of CNN training by exploiting finer-grained parallelism. Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 210-220 (Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019).
@inproceedings{141bc875c2094c8a85221f01cd185ed4,
title = "Improving strong-scaling of CNN training by exploiting finer-grained parallelism",
abstract = "Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the minibatch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.",
keywords = "Algorithms, Convolution, Deep learning, HPC, Performance modeling",
author = "Nikoli Dryden and Naoya Maruyama and Tom Benson and Tim Moon and Marc Snir and {Van Essen}, Brian",
year = "2019",
month = "5",
doi = "10.1109/IPDPS.2019.00031",
language = "English (US)",
series = "Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "210--220",
booktitle = "Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019",
address = "United States",

}

TY - GEN

T1 - Improving strong-scaling of CNN training by exploiting finer-grained parallelism

AU - Dryden, Nikoli

AU - Maruyama, Naoya

AU - Benson, Tom

AU - Moon, Tim

AU - Snir, Marc

AU - Van Essen, Brian

PY - 2019/5

Y1 - 2019/5

N2 - Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the minibatch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.

AB - Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the minibatch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.

KW - Algorithms

KW - Convolution

KW - Deep learning

KW - HPC

KW - Performance modeling

UR - http://www.scopus.com/inward/record.url?scp=85072053704&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072053704&partnerID=8YFLogxK

U2 - 10.1109/IPDPS.2019.00031

DO - 10.1109/IPDPS.2019.00031

M3 - Conference contribution

AN - SCOPUS:85072053704

T3 - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019

SP - 210

EP - 220

BT - Proceedings - 2019 IEEE 33rd International Parallel and Distributed Processing Symposium, IPDPS 2019

PB - Institute of Electrical and Electronics Engineers Inc.

ER -