TY - GEN
T1 - Scheduling many-task workloads on supercomputers
T2 - Dealing with trailing tasks
AU - Armstrong, Timothy G.
AU - Zhang, Zhao
AU - Katz, Daniel S.
AU - Wilde, Michael
AU - Foster, Ian T.
N1 - Copyright:
Copyright 2020 Elsevier B.V., All rights reserved.
PY - 2010
Y1 - 2010
N2 - In order for many-task applications to be attractive candidates for running on high-end supercomputers, they must be able to benefit from the additional compute, I/O, and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this means that the application should use the HPC resource in such a way that it can reduce time to solution beyond what is possible otherwise. Furthermore, it is necessary to make efficient use of the computational resources, achieving high levels of utilization. Satisfying these twin goals is not trivial, because while the parallelism in many task computations can vary over time, on many large machines the allocation policy requires that worker CPUs be provisioned and also relinquished in large blocks rather than individually. This paper discusses the problem in detail, explaining and characterizing the trade-off between utilization and time to solution under the allocation policies of Blue Gene/P Intrepid at Argonne National Laboratory. We propose and test two strategies to improve this trade-off: scheduling tasks in order of longest to shortest (applicable only if task runtimes are predictable) and downsizing allocations when utilization drops below some threshold. We show that both strategies are effective under different conditions.
AB - In order for many-task applications to be attractive candidates for running on high-end supercomputers, they must be able to benefit from the additional compute, I/O, and communication performance provided by high-end HPC hardware relative to clusters, grids, or clouds. Typically this means that the application should use the HPC resource in such a way that it can reduce time to solution beyond what is possible otherwise. Furthermore, it is necessary to make efficient use of the computational resources, achieving high levels of utilization. Satisfying these twin goals is not trivial, because while the parallelism in many task computations can vary over time, on many large machines the allocation policy requires that worker CPUs be provisioned and also relinquished in large blocks rather than individually. This paper discusses the problem in detail, explaining and characterizing the trade-off between utilization and time to solution under the allocation policies of Blue Gene/P Intrepid at Argonne National Laboratory. We propose and test two strategies to improve this trade-off: scheduling tasks in order of longest to shortest (applicable only if task runtimes are predictable) and downsizing allocations when utilization drops below some threshold. We show that both strategies are effective under different conditions.
KW - High-performance computing
KW - Many-task computing
KW - Scheduling
KW - Supercomputer systems
UR - http://www.scopus.com/inward/record.url?scp=79951828061&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=79951828061&partnerID=8YFLogxK
U2 - 10.1109/mtags.2010.5699433
DO - 10.1109/mtags.2010.5699433
M3 - Conference contribution
AN - SCOPUS:79951828061
SN - 9781424497041
T3 - 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS10
BT - 2010 3rd Workshop on Many-Task Computing on Grids and Supercomputers, MTAGS10
PB - IEEE Computer Society
ER -