Automatic execution of single-GPU computations across multiple GPUs

Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to detect array access patterns in GPU kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of GPU caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between GPUs. Results show 1.95x and 3.73x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

Original languageEnglish (US)
Title of host publicationPACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages467-468
Number of pages2
ISBN (Print)9781450328098
DOIs
StatePublished - Jan 1 2014
Event23rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2014 - Edmonton, AB, Canada
Duration: Aug 24 2014Aug 27 2014

Publication series

NameParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
ISSN (Print)1089-795X

Other

Other23rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2014
CountryCanada
CityEdmonton, AB
Period8/24/148/27/14

Keywords

  • multi-gpu programming
  • numa

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Fingerprint Dive into the research topics of 'Automatic execution of single-GPU computations across multiple GPUs'. Together they form a unique fingerprint.

Cite this