Automated Data Management and Learning-Based Scheduling for Ray-Based Hybrid HPC-Cloud Systems

Tingkai Liu, Huili Tao, Yicheng Lu, Zhongbo Zhu, Marquita Ellis, Sara Kokkila-Schumacher, Volodymyr Kindratenko

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

HPC-Cloud hybrid systems are gaining popularity among scientists for their ability to manage sudden demand spikes, resulting in faster turnaround times for HPC workloads. However, deploying workloads on such systems currently requires complicated configurations, particularly for data migration across HPC clusters and Cloud. Additionally, existing schedulers lack support for workload scheduling on such hybrid systems. To address these issues, we have designed and implemented an HPC-Cloud bursting system based on Ray, an open-source distributed framework. Our system integrates automated data management with learning-based scheduling at the function level, using a dynamic label-based design. It automatically prefetches data files based on demand and detects data movement and execution patterns for future scheduling decisions. The developed framework is evaluated with two workloads: machine learning model training and image processing. We compare its performance against naive data fetching under various network speeds and storage locations. Results indicate the effectiveness of our system across all scenarios. The system is open-sourced and the source code and replication packages for reproducing experimental results are provided.

Original languageEnglish (US)
Title of host publicationEuro-Par 2024
Subtitle of host publicationParallel Processing - 30th European Conference on Parallel and Distributed Processing, Proceedings
EditorsJesus Carretero, Javier Garcia-Blas, Sameer Shende, Ivona Brandic, Katzalin Olcoz, Martin Schreiber
PublisherSpringer
Pages180-194
Number of pages15
ISBN (Print)9783031695766
DOIs
StatePublished - 2024
Event30th International Conference on Parallel and Distributed Computing, Euro-Par 2024 - Madrid, Spain
Duration: Aug 26 2024Aug 30 2024

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume14801 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Conference

Conference30th International Conference on Parallel and Distributed Computing, Euro-Par 2024
Country/TerritorySpain
CityMadrid
Period8/26/248/30/24

Keywords

  • Cloud bursting
  • Data movement
  • HPC
  • Scheduling

ASJC Scopus subject areas

  • Theoretical Computer Science
  • General Computer Science

Fingerprint

Dive into the research topics of 'Automated Data Management and Learning-Based Scheduling for Ray-Based Hybrid HPC-Cloud Systems'. Together they form a unique fingerprint.

Cite this