TY - GEN
T1 - A multi-level scalable startup for parallel applications
AU - Gupta, Abhishek
AU - Zheng, Gengbin
AU - Kalé, Laxmikant V.
PY - 2011
Y1 - 2011
N2 - High performance parallel machines with hundreds of thousands of processors and petascale performance are already in use, and even larger Exaflops scale computing systems which may have hundreds of millions of cores are planned. To run parallel applications on machines of such massive scale, one of the biggest challenges is the parallel startup process. This task involves two components: (1) parallel launching of appropriate processes on the given set of processors and (2) setting up communication channels to enable the processes to communicate with each other after process launching has completed. Most current startup mechanisms focus on either using special purpose daemons which waste system resources or using a startup manager which becomes a scalability bottleneck. In this paper, we investigate the design and scalability of a SMP-aware, multi-level startup scheme with batching of remote shell sessions, which provides a complete solution to startup of a parallel application and facilitates its management during execution. It monitors process health and can be used to support recovery from failures and provide scalable interaction with the application. We demonstrate the performance and scalability of this scheme by applying it to startup Charm++ applications. In particular, starting up a Charm++ program on 16,384 cores of Ranger (at TACC) with Ethernet as the underlying communication layer takes only 25 seconds and attains a speedup of over 400% compared to MPICH2 startup (using hydra as process manager) and over 800% compared to Open MPI startup on Ranger.
AB - High performance parallel machines with hundreds of thousands of processors and petascale performance are already in use, and even larger Exaflops scale computing systems which may have hundreds of millions of cores are planned. To run parallel applications on machines of such massive scale, one of the biggest challenges is the parallel startup process. This task involves two components: (1) parallel launching of appropriate processes on the given set of processors and (2) setting up communication channels to enable the processes to communicate with each other after process launching has completed. Most current startup mechanisms focus on either using special purpose daemons which waste system resources or using a startup manager which becomes a scalability bottleneck. In this paper, we investigate the design and scalability of a SMP-aware, multi-level startup scheme with batching of remote shell sessions, which provides a complete solution to startup of a parallel application and facilitates its management during execution. It monitors process health and can be used to support recovery from failures and provide scalable interaction with the application. We demonstrate the performance and scalability of this scheme by applying it to startup Charm++ applications. In particular, starting up a Charm++ program on 16,384 cores of Ranger (at TACC) with Ethernet as the underlying communication layer takes only 25 seconds and attains a speedup of over 400% compared to MPICH2 startup (using hydra as process manager) and over 800% compared to Open MPI startup on Ranger.
UR - http://www.scopus.com/inward/record.url?scp=79959954323&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79959954323&partnerID=8YFLogxK
U2 - 10.1145/1988796.1988803
DO - 10.1145/1988796.1988803
M3 - Conference contribution
AN - SCOPUS:79959954323
SN - 9781450307611
T3 - Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2011
SP - 41
EP - 48
BT - Proceedings of the 1st International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2011
T2 - 1st International Workshop on Runtime and Operating Systems for Supercomputers, ROSS 2011
Y2 - 31 May 2011 through 31 May 2011
ER -