TY - GEN
T1 - Automated Data Management and Learning-Based Scheduling for Ray-Based Hybrid HPC-Cloud Systems
AU - Liu, Tingkai
AU - Tao, Huili
AU - Lu, Yicheng
AU - Zhu, Zhongbo
AU - Ellis, Marquita
AU - Kokkila-Schumacher, Sara
AU - Kindratenko, Volodymyr
N1 - We would like to thank Carlos Costa and Claudia Misale for their technical support and discussion. This work is funded by the IBM-Illinois Discovery Accelerator Institute. We are grateful to Amazon for providing Cloud resources on AWS.
PY - 2024
Y1 - 2024
N2 - HPC-Cloud hybrid systems are gaining popularity among scientists for their ability to manage sudden demand spikes, resulting in faster turnaround times for HPC workloads. However, deploying workloads on such systems currently requires complicated configurations, particularly for data migration across HPC clusters and Cloud. Additionally, existing schedulers lack support for workload scheduling on such hybrid systems. To address these issues, we have designed and implemented an HPC-Cloud bursting system based on Ray, an open-source distributed framework. Our system integrates automated data management with learning-based scheduling at the function level, using a dynamic label-based design. It automatically prefetches data files based on demand and detects data movement and execution patterns for future scheduling decisions. The developed framework is evaluated with two workloads: machine learning model training and image processing. We compare its performance against naive data fetching under various network speeds and storage locations. Results indicate the effectiveness of our system across all scenarios. The system is open-sourced and the source code and replication packages for reproducing experimental results are provided.
AB - HPC-Cloud hybrid systems are gaining popularity among scientists for their ability to manage sudden demand spikes, resulting in faster turnaround times for HPC workloads. However, deploying workloads on such systems currently requires complicated configurations, particularly for data migration across HPC clusters and Cloud. Additionally, existing schedulers lack support for workload scheduling on such hybrid systems. To address these issues, we have designed and implemented an HPC-Cloud bursting system based on Ray, an open-source distributed framework. Our system integrates automated data management with learning-based scheduling at the function level, using a dynamic label-based design. It automatically prefetches data files based on demand and detects data movement and execution patterns for future scheduling decisions. The developed framework is evaluated with two workloads: machine learning model training and image processing. We compare its performance against naive data fetching under various network speeds and storage locations. Results indicate the effectiveness of our system across all scenarios. The system is open-sourced and the source code and replication packages for reproducing experimental results are provided.
KW - Cloud bursting
KW - Data movement
KW - HPC
KW - Scheduling
UR - http://www.scopus.com/inward/record.url?scp=85202632091&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85202632091&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-69577-3_13
DO - 10.1007/978-3-031-69577-3_13
M3 - Conference contribution
AN - SCOPUS:85202632091
SN - 9783031695766
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 180
EP - 194
BT - Euro-Par 2024
A2 - Carretero, Jesus
A2 - Garcia-Blas, Javier
A2 - Shende, Sameer
A2 - Brandic, Ivona
A2 - Olcoz, Katzalin
A2 - Schreiber, Martin
PB - Springer
T2 - 30th International Conference on Parallel and Distributed Computing, Euro-Par 2024
Y2 - 26 August 2024 through 30 August 2024
ER -