A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks

Youjie Li, Jongse Park, Mohammad Alian, Yifan Yuan, Zheng Qu, Peitian Pan, Ren Wang, Alexander Schwing, Hadi Esmaeilzadeh, Nam Sung Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Training real-world Deep Neural Networks (DNNs) can take an eon (i.e., weeks or months) without leveraging distributed systems. Even distributed training takes inordinate time, of which a large fraction is spent in communicating weights and gradients over the network. State-of-The-Art distributed training algorithms use a hierarchy of worker-Aggregator nodes. The aggregators repeatedly receive gradient updates from their allocated group of the workers, and send back the updated weights. This paper sets out to reduce this significant communication cost by embedding data compression accelerators in the Network Interface Cards (NICs). To maximize the benefits of in-network acceleration, the proposed solution, named INCEPTIONN (In-Network Computing to Exchange and Process Training Information Of Neural Networks), uniquely combines hardware and algorithmic innovations by exploiting the following three observations. (1) Gradients are significantly more tolerant to precision loss than weights and as such lend themselves better to aggressive compression without the need for the complex mechanisms to avert any loss. (2) The existing training algorithms only communicate gradients in one leg of the communication, which reduces the opportunities for in-network acceleration of compression. (3) The aggregators can become a bottleneck with compression as they need to compress/decompress multiple streams from their allocated worker group. To this end, we first propose a lightweight and hardware-friendly lossy-compression algorithm for floating-point gradients, which exploits their unique value characteristics. This compression not only enables significantly reducing the gradient communication with practically no loss of accuracy, but also comes with low complexity for direct implementation as a hardware block in the NIC. To maximize the opportunities for compression and avoid the bottleneck at aggregators, we also propose an aggregator-free training algorithm that exchanges gradients in both legs of communication in the group, while the workers collectively perform the aggregation in a distributed manner. Without changing the mathematics of training, this algorithm leverages the associative property of the aggregation operator and enables our in-network accelerators to (1) apply compression for all communications, and (2) prevent the aggregator nodes from becoming bottlenecks. Our experiments demonstrate that INCEPTIONN reduces the communication time by 70.9~80.7% and offers 2.2~3.1x speedup over the conventional training system, while achieving the same level of accuracy.

Original languageEnglish (US)
Title of host publicationProceedings - 51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018
PublisherIEEE Computer Society
Number of pages14
ISBN (Electronic)9781538662403
StatePublished - Dec 12 2018
Event51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018 - Fukuoka, Japan
Duration: Oct 20 2018Oct 24 2018

Publication series

NameProceedings of the Annual International Symposium on Microarchitecture, MICRO
ISSN (Print)1072-4451


Other51st Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2018


  • Accelerators
  • DNN Training
  • Domain Specific Architectures
  • Reconfigurable Architectures and FPGA

ASJC Scopus subject areas

  • Hardware and Architecture


Dive into the research topics of 'A network-centric hardware/algorithm co-design to accelerate distributed training of deep neural networks'. Together they form a unique fingerprint.

Cite this