TY - GEN
T1 - Towards realizing the potential of malleable jobs
AU - Gupta, Abhishek
AU - Acun, Bilge
AU - Sarood, Osman
AU - Kale, Laxmikant V.
N1 - Publisher Copyright:
© 2014 IEEE.
PY - 2014
Y1 - 2014
N2 - Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs. To realize these benefits, three components are critical - an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. In this paper, we present a novel mechanism for enabling shrink/expand capability in the parallel runtime system using task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Our technique performs true shrink/expand eliminating the need of any residual processes, requires little application programmer effort, and is fast. Further, we establish a bidirectional communication channel between the resource manager and the parallel runtime, and present an asynchronous split-phase mechanism for executing adaptive scheduling decisions. Performance results using Charm++ on Stampede supercomputer show the efficacy, scalability, and benefits of our approach. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Also, we demonstrate the utility of our runtime in traditional as well as emerging scenarios, e.g., proactive fault tolerance and clouds.
AB - Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs. To realize these benefits, three components are critical - an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. In this paper, we present a novel mechanism for enabling shrink/expand capability in the parallel runtime system using task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Our technique performs true shrink/expand eliminating the need of any residual processes, requires little application programmer effort, and is fast. Further, we establish a bidirectional communication channel between the resource manager and the parallel runtime, and present an asynchronous split-phase mechanism for executing adaptive scheduling decisions. Performance results using Charm++ on Stampede supercomputer show the efficacy, scalability, and benefits of our approach. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Also, we demonstrate the utility of our runtime in traditional as well as emerging scenarios, e.g., proactive fault tolerance and clouds.
UR - http://www.scopus.com/inward/record.url?scp=84988226788&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84988226788&partnerID=8YFLogxK
U2 - 10.1109/HiPC.2014.7116905
DO - 10.1109/HiPC.2014.7116905
M3 - Conference contribution
AN - SCOPUS:84988226788
T3 - 2014 21st International Conference on High Performance Computing, HiPC 2014
BT - 2014 21st International Conference on High Performance Computing, HiPC 2014
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2014 21st International Conference on High Performance Computing, HiPC 2014
Y2 - 17 December 2014 through 20 December 2014
ER -