Accelerating distributed reinforcement learning with in-switch computing

Youjie Li, Iou Jen Liu, Yifan Yuan, Deming Chen, Alexander Gerhard Schwing, Jian Huang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Reinforcement learning (RL) has attracted much attention recently, as new and emerging AI-based applications are demanding the capabilities to intelligently react to environment changes. Unlike distributed deep neural network (DNN) training, the distributed RL training has its unique workload characteristics - it generates orders of magnitude more iterations with much smaller sized but more frequent gradient aggregations. More specifically, our study with typical RL algorithms shows that their distributed training is latency critical and that the network communication for gradient aggregation occupies up to 83.2% of the execution time of each training iteration. In this paper, we present iSwitch, an in-switch acceleration solution that moves the gradient aggregation from server nodes into the network switches, thus we can reduce the number of network hops for gradient aggregation. This not only reduces the end-to-end network latency for synchronous training, but also improves the convergence with faster weight updates for asynchronous training. Upon the in-switch accelerator, we further reduce the synchronization overhead by conducting on-the-fly gradient aggregation at the granularity of network packets rather than gradient vectors. Moreover, we rethink the distributed RL training algorithms and also propose a hierarchical aggregation mechanism to further increase the parallelism and scalability of the distributed RL training at rack scale. We implement iSwitch using a real-world programmable switch NetFPGA board. We extend the control and data plane of the programmable switch to support iSwitch without affecting its regular network functions. Compared with state-of-the-art distributed training approaches, iSwitch offers a system-level speedup of up to 3.66× for synchronous distributed training and 3.71× for asynchronous distributed training, while achieving better scalability.

Original languageEnglish (US)
Title of host publicationISCA 2019 - Proceedings of the 2019 46th International Symposium on Computer Architecture
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages279-291
Number of pages13
ISBN (Electronic)9781450366694
DOIs
StatePublished - Jun 22 2019
Event46th International Symposium on Computer Architecture, ISCA 2019 - Phoenix, United States
Duration: Jun 22 2019Jun 26 2019

Publication series

NameProceedings - International Symposium on Computer Architecture
ISSN (Print)1063-6897

Conference

Conference46th International Symposium on Computer Architecture, ISCA 2019
CountryUnited States
CityPhoenix
Period6/22/196/26/19

Fingerprint

Reinforcement learning
Agglomeration
Switches
Scalability
Packet networks
Learning algorithms
Telecommunication networks
Particle accelerators
Synchronization
Servers

Keywords

  • Distributed machine learning
  • In-network computing
  • In-switch accelerator
  • Reinforcement learning

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Li, Y., Liu, I. J., Yuan, Y., Chen, D., Schwing, A. G., & Huang, J. (2019). Accelerating distributed reinforcement learning with in-switch computing. In ISCA 2019 - Proceedings of the 2019 46th International Symposium on Computer Architecture (pp. 279-291). (Proceedings - International Symposium on Computer Architecture). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/3307650.3322259

Accelerating distributed reinforcement learning with in-switch computing. / Li, Youjie; Liu, Iou Jen; Yuan, Yifan; Chen, Deming; Schwing, Alexander Gerhard; Huang, Jian.

ISCA 2019 - Proceedings of the 2019 46th International Symposium on Computer Architecture. Institute of Electrical and Electronics Engineers Inc., 2019. p. 279-291 (Proceedings - International Symposium on Computer Architecture).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Li, Y, Liu, IJ, Yuan, Y, Chen, D, Schwing, AG & Huang, J 2019, Accelerating distributed reinforcement learning with in-switch computing. in ISCA 2019 - Proceedings of the 2019 46th International Symposium on Computer Architecture. Proceedings - International Symposium on Computer Architecture, Institute of Electrical and Electronics Engineers Inc., pp. 279-291, 46th International Symposium on Computer Architecture, ISCA 2019, Phoenix, United States, 6/22/19. https://doi.org/10.1145/3307650.3322259
Li Y, Liu IJ, Yuan Y, Chen D, Schwing AG, Huang J. Accelerating distributed reinforcement learning with in-switch computing. In ISCA 2019 - Proceedings of the 2019 46th International Symposium on Computer Architecture. Institute of Electrical and Electronics Engineers Inc. 2019. p. 279-291. (Proceedings - International Symposium on Computer Architecture). https://doi.org/10.1145/3307650.3322259
Li, Youjie ; Liu, Iou Jen ; Yuan, Yifan ; Chen, Deming ; Schwing, Alexander Gerhard ; Huang, Jian. / Accelerating distributed reinforcement learning with in-switch computing. ISCA 2019 - Proceedings of the 2019 46th International Symposium on Computer Architecture. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 279-291 (Proceedings - International Symposium on Computer Architecture).
@inproceedings{099af9db28b44898ba97b2f3b0ec8b79,
title = "Accelerating distributed reinforcement learning with in-switch computing",
abstract = "Reinforcement learning (RL) has attracted much attention recently, as new and emerging AI-based applications are demanding the capabilities to intelligently react to environment changes. Unlike distributed deep neural network (DNN) training, the distributed RL training has its unique workload characteristics - it generates orders of magnitude more iterations with much smaller sized but more frequent gradient aggregations. More specifically, our study with typical RL algorithms shows that their distributed training is latency critical and that the network communication for gradient aggregation occupies up to 83.2{\%} of the execution time of each training iteration. In this paper, we present iSwitch, an in-switch acceleration solution that moves the gradient aggregation from server nodes into the network switches, thus we can reduce the number of network hops for gradient aggregation. This not only reduces the end-to-end network latency for synchronous training, but also improves the convergence with faster weight updates for asynchronous training. Upon the in-switch accelerator, we further reduce the synchronization overhead by conducting on-the-fly gradient aggregation at the granularity of network packets rather than gradient vectors. Moreover, we rethink the distributed RL training algorithms and also propose a hierarchical aggregation mechanism to further increase the parallelism and scalability of the distributed RL training at rack scale. We implement iSwitch using a real-world programmable switch NetFPGA board. We extend the control and data plane of the programmable switch to support iSwitch without affecting its regular network functions. Compared with state-of-the-art distributed training approaches, iSwitch offers a system-level speedup of up to 3.66× for synchronous distributed training and 3.71× for asynchronous distributed training, while achieving better scalability.",
keywords = "Distributed machine learning, In-network computing, In-switch accelerator, Reinforcement learning",
author = "Youjie Li and Liu, {Iou Jen} and Yifan Yuan and Deming Chen and Schwing, {Alexander Gerhard} and Jian Huang",
year = "2019",
month = "6",
day = "22",
doi = "10.1145/3307650.3322259",
language = "English (US)",
series = "Proceedings - International Symposium on Computer Architecture",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "279--291",
booktitle = "ISCA 2019 - Proceedings of the 2019 46th International Symposium on Computer Architecture",
address = "United States",

}

TY - GEN

T1 - Accelerating distributed reinforcement learning with in-switch computing

AU - Li, Youjie

AU - Liu, Iou Jen

AU - Yuan, Yifan

AU - Chen, Deming

AU - Schwing, Alexander Gerhard

AU - Huang, Jian

PY - 2019/6/22

Y1 - 2019/6/22

N2 - Reinforcement learning (RL) has attracted much attention recently, as new and emerging AI-based applications are demanding the capabilities to intelligently react to environment changes. Unlike distributed deep neural network (DNN) training, the distributed RL training has its unique workload characteristics - it generates orders of magnitude more iterations with much smaller sized but more frequent gradient aggregations. More specifically, our study with typical RL algorithms shows that their distributed training is latency critical and that the network communication for gradient aggregation occupies up to 83.2% of the execution time of each training iteration. In this paper, we present iSwitch, an in-switch acceleration solution that moves the gradient aggregation from server nodes into the network switches, thus we can reduce the number of network hops for gradient aggregation. This not only reduces the end-to-end network latency for synchronous training, but also improves the convergence with faster weight updates for asynchronous training. Upon the in-switch accelerator, we further reduce the synchronization overhead by conducting on-the-fly gradient aggregation at the granularity of network packets rather than gradient vectors. Moreover, we rethink the distributed RL training algorithms and also propose a hierarchical aggregation mechanism to further increase the parallelism and scalability of the distributed RL training at rack scale. We implement iSwitch using a real-world programmable switch NetFPGA board. We extend the control and data plane of the programmable switch to support iSwitch without affecting its regular network functions. Compared with state-of-the-art distributed training approaches, iSwitch offers a system-level speedup of up to 3.66× for synchronous distributed training and 3.71× for asynchronous distributed training, while achieving better scalability.

AB - Reinforcement learning (RL) has attracted much attention recently, as new and emerging AI-based applications are demanding the capabilities to intelligently react to environment changes. Unlike distributed deep neural network (DNN) training, the distributed RL training has its unique workload characteristics - it generates orders of magnitude more iterations with much smaller sized but more frequent gradient aggregations. More specifically, our study with typical RL algorithms shows that their distributed training is latency critical and that the network communication for gradient aggregation occupies up to 83.2% of the execution time of each training iteration. In this paper, we present iSwitch, an in-switch acceleration solution that moves the gradient aggregation from server nodes into the network switches, thus we can reduce the number of network hops for gradient aggregation. This not only reduces the end-to-end network latency for synchronous training, but also improves the convergence with faster weight updates for asynchronous training. Upon the in-switch accelerator, we further reduce the synchronization overhead by conducting on-the-fly gradient aggregation at the granularity of network packets rather than gradient vectors. Moreover, we rethink the distributed RL training algorithms and also propose a hierarchical aggregation mechanism to further increase the parallelism and scalability of the distributed RL training at rack scale. We implement iSwitch using a real-world programmable switch NetFPGA board. We extend the control and data plane of the programmable switch to support iSwitch without affecting its regular network functions. Compared with state-of-the-art distributed training approaches, iSwitch offers a system-level speedup of up to 3.66× for synchronous distributed training and 3.71× for asynchronous distributed training, while achieving better scalability.

KW - Distributed machine learning

KW - In-network computing

KW - In-switch accelerator

KW - Reinforcement learning

UR - http://www.scopus.com/inward/record.url?scp=85069511767&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85069511767&partnerID=8YFLogxK

U2 - 10.1145/3307650.3322259

DO - 10.1145/3307650.3322259

M3 - Conference contribution

AN - SCOPUS:85069511767

T3 - Proceedings - International Symposium on Computer Architecture

SP - 279

EP - 291

BT - ISCA 2019 - Proceedings of the 2019 46th International Symposium on Computer Architecture

PB - Institute of Electrical and Electronics Engineers Inc.

ER -