TY - GEN
T1 - HogWild++
T2 - 16th IEEE International Conference on Data Mining, ICDM 2016
AU - Zhang, Huan
AU - Hsieh, Cho Jui
AU - Akella, Venkatesh
N1 - Publisher Copyright:
© 2016 IEEE.
PY - 2016/7/2
Y1 - 2016/7/2
N2 - Stochastic Gradient Descent (SGD) is a popular technique for solving large-scale machine learning problems. In order to parallelize SGD on multi-core machines, asynchronous SGD (Hogwild) has been proposed, where each core updates a global model vector stored in a shared memory simultaneously, without using explicit locks. We show that the scalability of Hogwild on modern multi-socket CPUs is severely limited, especially on NUMA (Non-Uniform Memory Access) system, due to the excessive cache invalidation requests and false sharing. In this paper we propose a novel decentralized asynchronous SGD algorithm called HogWild++ that overcomes these drawbacks and shows almost linear speedup on multi-socket NUMA systems. The main idea in HogWild++ is to replace the global model vector with a set of local model vectors that are shared by a cluster (a set of cores), keep them synchronized through a decentralized token-based protocol that minimizes remote memory access conflicts and ensures convergence. We present the design and experimental evaluation of HogWild++ on a variety of datasets and show that it outperforms state-of-The-Art parallel SGD implementations in terms of efficiency and scalability.
AB - Stochastic Gradient Descent (SGD) is a popular technique for solving large-scale machine learning problems. In order to parallelize SGD on multi-core machines, asynchronous SGD (Hogwild) has been proposed, where each core updates a global model vector stored in a shared memory simultaneously, without using explicit locks. We show that the scalability of Hogwild on modern multi-socket CPUs is severely limited, especially on NUMA (Non-Uniform Memory Access) system, due to the excessive cache invalidation requests and false sharing. In this paper we propose a novel decentralized asynchronous SGD algorithm called HogWild++ that overcomes these drawbacks and shows almost linear speedup on multi-socket NUMA systems. The main idea in HogWild++ is to replace the global model vector with a set of local model vectors that are shared by a cluster (a set of cores), keep them synchronized through a decentralized token-based protocol that minimizes remote memory access conflicts and ensures convergence. We present the design and experimental evaluation of HogWild++ on a variety of datasets and show that it outperforms state-of-The-Art parallel SGD implementations in terms of efficiency and scalability.
KW - Decentralized algorithm
KW - Non-uniform memory access (NUMA) architecture
KW - Stochastic gradient descent
UR - http://www.scopus.com/inward/record.url?scp=85014546892&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85014546892&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2016.169
DO - 10.1109/ICDM.2016.169
M3 - Conference contribution
AN - SCOPUS:85014546892
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 629
EP - 638
BT - Proceedings - 16th IEEE International Conference on Data Mining, ICDM 2016
A2 - Bonchi, Francesco
A2 - Domingo-Ferrer, Josep
A2 - Baeza-Yates, Ricardo
A2 - Zhou, Zhi-Hua
A2 - Wu, Xindong
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 12 December 2016 through 15 December 2016
ER -