Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training

Mingchao Yu, Zhifeng Lin, Krishna Narra, Songze Li, Youjie Li, Nam Sung Kim, Alexander Schwing, Murali Annavaram, Salman Avestimehr

Research output: Contribution to journalConference article

Abstract

Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiVeQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiVeQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiVeQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5X without noticeable accuracy loss, and reduces the end-to-end training time by almost 50%. The results also show that GradiVeQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.

Original languageEnglish (US)
Pages (from-to)5123-5133
Number of pages11
JournalAdvances in Neural Information Processing Systems
Volume2018-December
StatePublished - Jan 1 2018
Event32nd Conference on Neural Information Processing Systems, NeurIPS 2018 - Montreal, Canada
Duration: Dec 2 2018Dec 8 2018

Fingerprint

Vector quantization
Agglomeration
Neural networks
Bandwidth
Communication
Principal component analysis
Learning systems
Clocks
Costs
Experiments

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Signal Processing

Cite this

Yu, M., Lin, Z., Narra, K., Li, S., Li, Y., Kim, N. S., ... Avestimehr, S. (2018). Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training. Advances in Neural Information Processing Systems, 2018-December, 5123-5133.

Gradiveq : Vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training. / Yu, Mingchao; Lin, Zhifeng; Narra, Krishna; Li, Songze; Li, Youjie; Kim, Nam Sung; Schwing, Alexander; Annavaram, Murali; Avestimehr, Salman.

In: Advances in Neural Information Processing Systems, Vol. 2018-December, 01.01.2018, p. 5123-5133.

Research output: Contribution to journalConference article

Yu, M, Lin, Z, Narra, K, Li, S, Li, Y, Kim, NS, Schwing, A, Annavaram, M & Avestimehr, S 2018, 'Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training', Advances in Neural Information Processing Systems, vol. 2018-December, pp. 5123-5133.
Yu, Mingchao ; Lin, Zhifeng ; Narra, Krishna ; Li, Songze ; Li, Youjie ; Kim, Nam Sung ; Schwing, Alexander ; Annavaram, Murali ; Avestimehr, Salman. / Gradiveq : Vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training. In: Advances in Neural Information Processing Systems. 2018 ; Vol. 2018-December. pp. 5123-5133.
@article{f9c347b1455140edbf51ca26ca9ff608,
title = "Gradiveq: Vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training",
abstract = "Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiVeQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiVeQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiVeQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5X without noticeable accuracy loss, and reduces the end-to-end training time by almost 50{\%}. The results also show that GradiVeQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.",
author = "Mingchao Yu and Zhifeng Lin and Krishna Narra and Songze Li and Youjie Li and Kim, {Nam Sung} and Alexander Schwing and Murali Annavaram and Salman Avestimehr",
year = "2018",
month = "1",
day = "1",
language = "English (US)",
volume = "2018-December",
pages = "5123--5133",
journal = "Advances in Neural Information Processing Systems",
issn = "1049-5258",

}

TY - JOUR

T1 - Gradiveq

T2 - Vector quantization for bandwidth-efficient gradient aggregation in distributed CNN training

AU - Yu, Mingchao

AU - Lin, Zhifeng

AU - Narra, Krishna

AU - Li, Songze

AU - Li, Youjie

AU - Kim, Nam Sung

AU - Schwing, Alexander

AU - Annavaram, Murali

AU - Avestimehr, Salman

PY - 2018/1/1

Y1 - 2018/1/1

N2 - Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiVeQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiVeQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiVeQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5X without noticeable accuracy loss, and reduces the end-to-end training time by almost 50%. The results also show that GradiVeQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.

AB - Data parallelism can boost the training speed of convolutional neural networks (CNN), but could suffer from significant communication costs caused by gradient aggregation. To alleviate this problem, several scalar quantization techniques have been developed to compress the gradients. But these techniques could perform poorly when used together with decentralized aggregation protocols like ring all-reduce (RAR), mainly due to their inability to directly aggregate compressed gradients. In this paper, we empirically demonstrate the strong linear correlations between CNN gradients, and propose a gradient vector quantization technique, named GradiVeQ, to exploit these correlations through principal component analysis (PCA) for substantial gradient dimension reduction. GradiVeQ enables direct aggregation of compressed gradients, hence allows us to build a distributed learning system that parallelizes GradiVeQ gradient compression and RAR communications. Extensive experiments on popular CNNs demonstrate that applying GradiVeQ slashes the wall-clock gradient aggregation time of the original RAR by more than 5X without noticeable accuracy loss, and reduces the end-to-end training time by almost 50%. The results also show that GradiVeQ is compatible with scalar quantization techniques such as QSGD (Quantized SGD), and achieves a much higher speed-up gain under the same compression ratio.

UR - http://www.scopus.com/inward/record.url?scp=85064823884&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064823884&partnerID=8YFLogxK

M3 - Conference article

AN - SCOPUS:85064823884

VL - 2018-December

SP - 5123

EP - 5133

JO - Advances in Neural Information Processing Systems

JF - Advances in Neural Information Processing Systems

SN - 1049-5258

ER -