Automatic execution of single-GPU computations across multiple GPUs

Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to detect array access patterns in GPU kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of GPU caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between GPUs. Results show 1.95x and 3.73x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

Original languageEnglish (US)
Title of host publicationPACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages467-468
Number of pages2
ISBN (Print)9781450328098
DOIs
StatePublished - Jan 1 2014
Event23rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2014 - Edmonton, AB, Canada
Duration: Aug 24 2014Aug 27 2014

Publication series

NameParallel Architectures and Compilation Techniques - Conference Proceedings, PACT
ISSN (Print)1089-795X

Other

Other23rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2014
CountryCanada
CityEdmonton, AB
Period8/24/148/27/14

Fingerprint

kernel
Decompose
Runtime Systems
Data Distribution
Interconnect
Accessibility
Compiler
Cache
Scalability
Programming
Choose
Bandwidth
Configuration
Range of data
Graphics processing unit
Data storage equipment
Information use
Framework

Keywords

  • multi-gpu programming
  • numa

ASJC Scopus subject areas

  • Software
  • Theoretical Computer Science
  • Hardware and Architecture

Cite this

Cabezas, J., Vilanova, L., Gelado, I., Jablin, T. B., Navarro, N., & Hwu, W-M. W. (2014). Automatic execution of single-GPU computations across multiple GPUs. In PACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques (pp. 467-468). (Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1145/2628071.2628109

Automatic execution of single-GPU computations across multiple GPUs. / Cabezas, Javier; Vilanova, Lluís; Gelado, Isaac; Jablin, Thomas B.; Navarro, Nacho; Hwu, Wen-Mei W.

PACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. Institute of Electrical and Electronics Engineers Inc., 2014. p. 467-468 (Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cabezas, J, Vilanova, L, Gelado, I, Jablin, TB, Navarro, N & Hwu, W-MW 2014, Automatic execution of single-GPU computations across multiple GPUs. in PACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT, Institute of Electrical and Electronics Engineers Inc., pp. 467-468, 23rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2014, Edmonton, AB, Canada, 8/24/14. https://doi.org/10.1145/2628071.2628109
Cabezas J, Vilanova L, Gelado I, Jablin TB, Navarro N, Hwu W-MW. Automatic execution of single-GPU computations across multiple GPUs. In PACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. Institute of Electrical and Electronics Engineers Inc. 2014. p. 467-468. (Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT). https://doi.org/10.1145/2628071.2628109
Cabezas, Javier ; Vilanova, Lluís ; Gelado, Isaac ; Jablin, Thomas B. ; Navarro, Nacho ; Hwu, Wen-Mei W. / Automatic execution of single-GPU computations across multiple GPUs. PACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques. Institute of Electrical and Electronics Engineers Inc., 2014. pp. 467-468 (Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT).
@inproceedings{783e2f0fa25544ecb4aa7a8f17105c85,
title = "Automatic execution of single-GPU computations across multiple GPUs",
abstract = "We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to detect array access patterns in GPU kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of GPU caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between GPUs. Results show 1.95x and 3.73x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.",
keywords = "multi-gpu programming, numa",
author = "Javier Cabezas and Llu{\'i}s Vilanova and Isaac Gelado and Jablin, {Thomas B.} and Nacho Navarro and Hwu, {Wen-Mei W}",
year = "2014",
month = "1",
day = "1",
doi = "10.1145/2628071.2628109",
language = "English (US)",
isbn = "9781450328098",
series = "Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "467--468",
booktitle = "PACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques",
address = "United States",

}

TY - GEN

T1 - Automatic execution of single-GPU computations across multiple GPUs

AU - Cabezas, Javier

AU - Vilanova, Lluís

AU - Gelado, Isaac

AU - Jablin, Thomas B.

AU - Navarro, Nacho

AU - Hwu, Wen-Mei W

PY - 2014/1/1

Y1 - 2014/1/1

N2 - We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to detect array access patterns in GPU kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of GPU caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between GPUs. Results show 1.95x and 3.73x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

AB - We present AMGE, a programming framework and runtime system to decompose data and GPU kernels and execute them on multiple GPUs concurrently. AMGE exploits the remote memory access capability of recent GPUs to guarantee data accessibility regardless of its physical location, thus allowing AMGE to safely decompose and distribute arrays across GPU memories. AMGE also includes a compiler analysis to detect array access patterns in GPU kernels. The runtime uses this information to automatically choose the best computation and data distribution configuration. Through effective use of GPU caches, AMGE achieves good scalability in spite of the limited interconnect bandwidth between GPUs. Results show 1.95x and 3.73x execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

KW - multi-gpu programming

KW - numa

UR - http://www.scopus.com/inward/record.url?scp=84907087096&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84907087096&partnerID=8YFLogxK

U2 - 10.1145/2628071.2628109

DO - 10.1145/2628071.2628109

M3 - Conference contribution

AN - SCOPUS:84907087096

SN - 9781450328098

T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT

SP - 467

EP - 468

BT - PACT 2014 - Proceedings of the 23rd International Conference on Parallel Architectures and Compilation Techniques

PB - Institute of Electrical and Electronics Engineers Inc.

ER -