TY - GEN
T1 - PIC
T2 - 2012 IEEE International Conference on Cluster Computing, CLUSTER 2012
AU - Farivar, Reza
AU - Raghunathan, Anand
AU - Chakradhar, Srimat
AU - Kharbanda, Harshit
AU - Campbell, Roy H.
PY - 2012
Y1 - 2012
N2 - Iterative-convergence algorithms are frequently used in a variety of domains to build models from large data sets. Cluster implementations of these algorithms are commonly realized using parallel programming models such as MapReduce. However, these implementations suffer from significant performance bottlenecks, especially due to large volumes of network traffic resulting from intermediate data and model updates during the iterations. To address these challenges, we propose partitioned iterative convergence (PIC), a new approach to programming and executing iterative convergence algorithms on frameworks like MapReduce. In PIC, we execute the iterative-convergence computation in two phases - the best-effort phase, which quickly produces a good initial model and the top-off phase, which further refines this model to produce the final solution. The best-effort phase iteratively performs the following steps: (a) partition the input data and the model to create several smaller, model-building sub-problems, (b) independently solve these sub-problems using iterative convergence computations, and (c) merge solutions of the sub-problems to create the next version of the model. This partitioned, loosely coupled execution of the computation produces a model of good quality, while drastically reducing network traffic due to intermediate data and model updates. The top-off phase further refines this model by employing the original iterative-convergence computation on the entire (un-partitioned) problem until convergence. However, the number of iterations executed in the top-off phase is quite small, resulting in a significant overall improvement in performance. We have implemented a library for PIC on top of the Hadoop MapReduce framework, and evaluated it using five popular iterative-convergence algorithms (Page Rank, K-Means clustering, neural network training, linear equation solver and image smoothing). Our evaluations on clusters ranging from 6 nodes to 256 nodes demonstrate a 2.5X-4X speedup compared to conventional implementations using Hadoop.
AB - Iterative-convergence algorithms are frequently used in a variety of domains to build models from large data sets. Cluster implementations of these algorithms are commonly realized using parallel programming models such as MapReduce. However, these implementations suffer from significant performance bottlenecks, especially due to large volumes of network traffic resulting from intermediate data and model updates during the iterations. To address these challenges, we propose partitioned iterative convergence (PIC), a new approach to programming and executing iterative convergence algorithms on frameworks like MapReduce. In PIC, we execute the iterative-convergence computation in two phases - the best-effort phase, which quickly produces a good initial model and the top-off phase, which further refines this model to produce the final solution. The best-effort phase iteratively performs the following steps: (a) partition the input data and the model to create several smaller, model-building sub-problems, (b) independently solve these sub-problems using iterative convergence computations, and (c) merge solutions of the sub-problems to create the next version of the model. This partitioned, loosely coupled execution of the computation produces a model of good quality, while drastically reducing network traffic due to intermediate data and model updates. The top-off phase further refines this model by employing the original iterative-convergence computation on the entire (un-partitioned) problem until convergence. However, the number of iterations executed in the top-off phase is quite small, resulting in a significant overall improvement in performance. We have implemented a library for PIC on top of the Hadoop MapReduce framework, and evaluated it using five popular iterative-convergence algorithms (Page Rank, K-Means clustering, neural network training, linear equation solver and image smoothing). Our evaluations on clusters ranging from 6 nodes to 256 nodes demonstrate a 2.5X-4X speedup compared to conventional implementations using Hadoop.
UR - http://www.scopus.com/inward/record.url?scp=84870720970&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84870720970&partnerID=8YFLogxK
U2 - 10.1109/CLUSTER.2012.84
DO - 10.1109/CLUSTER.2012.84
M3 - Conference contribution
AN - SCOPUS:84870720970
SN - 9780768548074
T3 - Proceedings - 2012 IEEE International Conference on Cluster Computing, CLUSTER 2012
SP - 391
EP - 401
BT - Proceedings - 2012 IEEE International Conference on Cluster Computing, CLUSTER 2012
PB - IEEE Computer Society
Y2 - 24 September 2012 through 28 September 2012
ER -