TY - GEN
T1 - A Batch System with Efficient Adaptive Scheduling for Malleable and Evolving Applications
AU - Prabhakaran, Suraj
AU - Neumann, Marcel
AU - Rinke, Sebastian
AU - Wolf, Felix
AU - Gupta, Abhishek
AU - Kale, Laxmikant V
PY - 2015/7/17
Y1 - 2015/7/17
N2 - The throughput of supercomputers depends not only on efficient job scheduling but also on the type of jobs that form the workload. Malleable jobs are most favourable for a cluster as they can dynamically adapt to a changing allocation of resources. The batch system can expand or shrink a running malleable job to improve system utilization, throughput, and response times. In the past, however, the rigid nature of commonly used programming models like MPI made writing malleable applications a daunting task, which is why it remained largely unrealized. This is now changing. To improve fault tolerance, load imbalance, and energy efficiency in emerging exactable systems, more adaptive programming paradigms such as Charm++ enter the scene. Although they offer better support for malleability, current batch systems still lack management facilities for malleable jobs and are therefore incapable of leveraging their potential. In this paper, we present an extension of the Torque/Maui batch system for malleability. We propose a novel malleable job scheduling strategy and show the first batch system capable of efficiently managing rigid, malleable, and evolving jobs together. We demonstrate that our strategy achieves consistently superior performance in comparison to every other state-of-the-art malleable job scheduling strategy under varying dynamics of the workload.
AB - The throughput of supercomputers depends not only on efficient job scheduling but also on the type of jobs that form the workload. Malleable jobs are most favourable for a cluster as they can dynamically adapt to a changing allocation of resources. The batch system can expand or shrink a running malleable job to improve system utilization, throughput, and response times. In the past, however, the rigid nature of commonly used programming models like MPI made writing malleable applications a daunting task, which is why it remained largely unrealized. This is now changing. To improve fault tolerance, load imbalance, and energy efficiency in emerging exactable systems, more adaptive programming paradigms such as Charm++ enter the scene. Although they offer better support for malleability, current batch systems still lack management facilities for malleable jobs and are therefore incapable of leveraging their potential. In this paper, we present an extension of the Torque/Maui batch system for malleability. We propose a novel malleable job scheduling strategy and show the first batch system capable of efficiently managing rigid, malleable, and evolving jobs together. We demonstrate that our strategy achieves consistently superior performance in comparison to every other state-of-the-art malleable job scheduling strategy under varying dynamics of the workload.
KW - adaptive resource management
KW - adaptive scheduling
KW - batch systems
KW - evolving jobs
KW - malleable jobs
UR - http://www.scopus.com/inward/record.url?scp=84971441032&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84971441032&partnerID=8YFLogxK
U2 - 10.1109/IPDPS.2015.34
DO - 10.1109/IPDPS.2015.34
M3 - Conference contribution
AN - SCOPUS:84971441032
T3 - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
SP - 429
EP - 438
BT - Proceedings - 2015 IEEE 29th International Parallel and Distributed Processing Symposium, IPDPS 2015
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 29th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2015
Y2 - 25 May 2015 through 29 May 2015
ER -