TY - JOUR
T1 - Communication compression for decentralized training
AU - Tang, Hanlin
AU - Gan, Shaoduo
AU - Zhang, Ce
AU - Zhang, Tong
AU - Liu, Ji
N1 - Hanlin Tang and Ji Liu are in part supported by NSF CCF1718513, IBM faculty award, and NEC fellowship. CZ and the DS3Lab gratefully acknowledge the support from Mercedes-Benz Research & Development North America, Oracle Labs, Swisscom, Zurich Insurance, Chinese Scholarship Council, and the Department of Computer Science at ETH Zurich.
PY - 2018
Y1 - 2018
N2 - Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: communication compression for low bandwidth networks, and decentralization for high latency networks. In this paper, We explore a natural question: can the combination of both techniques lead to a system that is robust to both bandwidth and latency? Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: unlike centralized algorithms, simply compressing exchanged information, even in an unbiased stochastic way, within the decentralized network would accumulate the error and fail to converge. In this paper, we develop a framework of compressed, decentralized training and propose two different strategies, which we call extrapolation compression and difference compression. We analyze both algorithms and prove both converge at the rate of O(1/nT) where n is the number of workers and T is the number of iterations, matching the convergence rate for full precision, centralized training. We validate our algorithms and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with both high latency and low bandwidth.
AB - Optimizing distributed learning systems is an art of balancing between computation and communication. There have been two lines of research that try to deal with slower networks: communication compression for low bandwidth networks, and decentralization for high latency networks. In this paper, We explore a natural question: can the combination of both techniques lead to a system that is robust to both bandwidth and latency? Although the system implication of such combination is trivial, the underlying theoretical principle and algorithm design is challenging: unlike centralized algorithms, simply compressing exchanged information, even in an unbiased stochastic way, within the decentralized network would accumulate the error and fail to converge. In this paper, we develop a framework of compressed, decentralized training and propose two different strategies, which we call extrapolation compression and difference compression. We analyze both algorithms and prove both converge at the rate of O(1/nT) where n is the number of workers and T is the number of iterations, matching the convergence rate for full precision, centralized training. We validate our algorithms and find that our proposed algorithm outperforms the best of merely decentralized and merely quantized algorithm significantly for networks with both high latency and low bandwidth.
UR - http://www.scopus.com/inward/record.url?scp=85064848977&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85064848977&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:85064848977
SN - 1049-5258
VL - 2018-December
SP - 7652
EP - 7662
JO - Advances in Neural Information Processing Systems
JF - Advances in Neural Information Processing Systems
T2 - 32nd Conference on Neural Information Processing Systems, NeurIPS 2018
Y2 - 2 December 2018 through 8 December 2018
ER -