TY - GEN
T1 - Cloud-Bursting and Autoscaling for Python-Native Scientific Workflows Using Ray
AU - Liu, Tingkai
AU - Ellis, Marquita
AU - Costa, Carlos
AU - Misale, Claudia
AU - Kokkila-Schumacher, Sara
AU - Jung, Jinwook
AU - Nam, Gi Joon
AU - Kindratenko, Volodymyr
N1 - Acknowledgement. This work is supported by the IBM-Illinois Discovery Accelerator Institute. This work utilizes resources supported by the National Science Foundation’s Major Research Instrumentation program, grant #1725729, as well as the University of Illinois Urbana-Champaign.
PY - 2023
Y1 - 2023
N2 - We have extended the Ray framework to enable automatic scaling of workloads on high-performance computing (HPC) clusters managed by SLURM© and bursting to Cloud managed by Kubernetes®. Compared to existing HPC-Cloud convergence solutions, our framework demonstrates advantages in several aspects: users can provide their own Cloud resource, the framework provides the Python-level abstraction that does not require users to interact with job submission systems, and allows a single Python-based parallel workload to be run concurrently across an HPC cluster and a Cloud. Applications in Electronic Design Automation are used to demonstrate the functionality of this solution in scaling the workload on an on-premises HPC system and automatically bursting to a public Cloud when running out of allocated HPC resources. The paper focuses on describing the initial implementation and demonstrating novel functionality of the proposed framework as well as identifying practical considerations and limitations for using Cloud bursting mode. The code of our framework is open-sourced.
AB - We have extended the Ray framework to enable automatic scaling of workloads on high-performance computing (HPC) clusters managed by SLURM© and bursting to Cloud managed by Kubernetes®. Compared to existing HPC-Cloud convergence solutions, our framework demonstrates advantages in several aspects: users can provide their own Cloud resource, the framework provides the Python-level abstraction that does not require users to interact with job submission systems, and allows a single Python-based parallel workload to be run concurrently across an HPC cluster and a Cloud. Applications in Electronic Design Automation are used to demonstrate the functionality of this solution in scaling the workload on an on-premises HPC system and automatically bursting to a public Cloud when running out of allocated HPC resources. The paper focuses on describing the initial implementation and demonstrating novel functionality of the proposed framework as well as identifying practical considerations and limitations for using Cloud bursting mode. The code of our framework is open-sourced.
KW - Cloud bursting
KW - HPC
KW - Kubernetes
UR - http://www.scopus.com/inward/record.url?scp=85171335334&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85171335334&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-40843-4_16
DO - 10.1007/978-3-031-40843-4_16
M3 - Conference contribution
AN - SCOPUS:85171335334
SN - 9783031408427
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 207
EP - 220
BT - High Performance Computing - ISC High Performance 2023 International Workshops, Revised Selected Papers
A2 - Bienz, Amanda
A2 - Weiland, Michèle
A2 - Baboulin, Marc
A2 - Kruse, Carola
PB - Springer
T2 - 38th International Conference on High Performance Computing, ISC High Performance 2023
Y2 - 21 May 2023 through 25 May 2023
ER -