Towards realizing the potential of malleable jobs

Abhishek Gupta, Bilge Acun, Osman Sarood, Laxmikant V. Kale

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Malleable jobs are those which can dynamically shrink or expand the number of processors on which they are executing at runtime in response to an external command. Malleable jobs can significantly improve system utilization and reduce average response time, compared to traditional jobs. To realize these benefits, three components are critical - an adaptive job scheduler, an adaptive resource manager, and an adaptive parallel runtime system. In this paper, we present a novel mechanism for enabling shrink/expand capability in the parallel runtime system using task migration and dynamic load balancing, checkpoint-restart, and Linux shared memory. Our technique performs true shrink/expand eliminating the need of any residual processes, requires little application programmer effort, and is fast. Further, we establish a bidirectional communication channel between the resource manager and the parallel runtime, and present an asynchronous split-phase mechanism for executing adaptive scheduling decisions. Performance results using Charm++ on Stampede supercomputer show the efficacy, scalability, and benefits of our approach. Shrinking from 2k to 1k cores takes 16s while expand from 1k to 2k takes 40s. Also, we demonstrate the utility of our runtime in traditional as well as emerging scenarios, e.g., proactive fault tolerance and clouds.

Original languageEnglish (US)
Title of host publication2014 21st International Conference on High Performance Computing, HiPC 2014
PublisherInstitute of Electrical and Electronics Engineers Inc.
ISBN (Electronic)9781479959761
DOIs
StatePublished - 2014
Event2014 21st International Conference on High Performance Computing, HiPC 2014 - Goa, India
Duration: Dec 17 2014Dec 20 2014

Publication series

Name2014 21st International Conference on High Performance Computing, HiPC 2014

Other

Other2014 21st International Conference on High Performance Computing, HiPC 2014
Country/TerritoryIndia
CityGoa
Period12/17/1412/20/14

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Hardware and Architecture
  • Software

Fingerprint

Dive into the research topics of 'Towards realizing the potential of malleable jobs'. Together they form a unique fingerprint.

Cite this