Improving goodput by coscheduling CPU and network capacity

Jim Basney, Miron Livny

Research output: Contribution to journalArticlepeer-review

Abstract

In a cluster computing environment, executable, checkpoint, and data files must be transferred between application submission and execution sites. As the memory footprint of cluster applications increases, saving and restoring the state of a computation in such an environment may require substantial network resources at both the start and the end of a CPU allocation. During the allocation, the application may also consume network bandwidth to periodically transfer a checkpoint back to the submission site or checkpoint server and to access remote data files. Under most circumstances, the application cannot use the allocated CPU while these transfers are in progress. Furthermore, if the application is unable to transfer a checkpoint or successfully migrate at preemption time, work already accomplished by the application is lost. The authors define goodput as the allocation time when a remotely executing application uses the CPU to make forward progress. Goodput can be significantly less than allocated throughput due to network activity. The authors are currently engaged in an effort to develop coscheduling techniques for CPU and network resources that will improve the goodput delivered by Condor pools. They report techniques that they have developed so far, how they were implemented in Condor, and their preliminary impact on the goodput of the authors' production Condor pool.

Original languageEnglish (US)
Pages (from-to)220-230
Number of pages11
JournalInternational Journal of High Performance Computing Applications
Volume13
Issue number3
DOIs
StatePublished - 1999
Externally publishedYes

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Improving goodput by coscheduling CPU and network capacity'. Together they form a unique fingerprint.

Cite this