Automatic parallelization of kernels in shared-memory multi-GPU nodes

Javier Cabezas, Lluís Vilanova, Isaac Gelado, Thomas B. Jablin, Nacho Navarro, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98× and 3.89× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

Original languageEnglish (US)
Title of host publicationICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing
PublisherAssociation for Computing Machinery
Pages3-13
Number of pages11
ISBN (Electronic)9781450335591
DOIs
StatePublished - Jun 8 2015
Event29th ACM International Conference on Supercomputing, ICS 2015 - Newport Beach, United States
Duration: Jun 8 2015Jun 11 2015

Publication series

NameProceedings of the International Conference on Supercomputing
Volume2015-June

Other

Other29th ACM International Conference on Supercomputing, ICS 2015
CountryUnited States
CityNewport Beach
Period6/8/156/11/15

Fingerprint

Data storage equipment
Graphics processing unit
Scheduling
Costs

Keywords

  • Multi-GPU programming
  • NUMA

ASJC Scopus subject areas

  • Computer Science(all)

Cite this

Cabezas, J., Vilanova, L., Gelado, I., Jablin, T. B., Navarro, N., & Hwu, W-M. W. (2015). Automatic parallelization of kernels in shared-memory multi-GPU nodes. In ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing (pp. 3-13). (Proceedings of the International Conference on Supercomputing; Vol. 2015-June). Association for Computing Machinery. https://doi.org/10.1145/2751205.2751218

Automatic parallelization of kernels in shared-memory multi-GPU nodes. / Cabezas, Javier; Vilanova, Lluís; Gelado, Isaac; Jablin, Thomas B.; Navarro, Nacho; Hwu, Wen-Mei W.

ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing. Association for Computing Machinery, 2015. p. 3-13 (Proceedings of the International Conference on Supercomputing; Vol. 2015-June).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cabezas, J, Vilanova, L, Gelado, I, Jablin, TB, Navarro, N & Hwu, W-MW 2015, Automatic parallelization of kernels in shared-memory multi-GPU nodes. in ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing. Proceedings of the International Conference on Supercomputing, vol. 2015-June, Association for Computing Machinery, pp. 3-13, 29th ACM International Conference on Supercomputing, ICS 2015, Newport Beach, United States, 6/8/15. https://doi.org/10.1145/2751205.2751218
Cabezas J, Vilanova L, Gelado I, Jablin TB, Navarro N, Hwu W-MW. Automatic parallelization of kernels in shared-memory multi-GPU nodes. In ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing. Association for Computing Machinery. 2015. p. 3-13. (Proceedings of the International Conference on Supercomputing). https://doi.org/10.1145/2751205.2751218
Cabezas, Javier ; Vilanova, Lluís ; Gelado, Isaac ; Jablin, Thomas B. ; Navarro, Nacho ; Hwu, Wen-Mei W. / Automatic parallelization of kernels in shared-memory multi-GPU nodes. ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing. Association for Computing Machinery, 2015. pp. 3-13 (Proceedings of the International Conference on Supercomputing).
@inproceedings{13fd1ef47ae64e79b1ab12e18d3e65c3,
title = "Automatic parallelization of kernels in shared-memory multi-GPU nodes",
abstract = "In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5{\%}. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98× and 3.89× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.",
keywords = "Multi-GPU programming, NUMA",
author = "Javier Cabezas and Llu{\'i}s Vilanova and Isaac Gelado and Jablin, {Thomas B.} and Nacho Navarro and Hwu, {Wen-Mei W}",
year = "2015",
month = "6",
day = "8",
doi = "10.1145/2751205.2751218",
language = "English (US)",
series = "Proceedings of the International Conference on Supercomputing",
publisher = "Association for Computing Machinery",
pages = "3--13",
booktitle = "ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing",

}

TY - GEN

T1 - Automatic parallelization of kernels in shared-memory multi-GPU nodes

AU - Cabezas, Javier

AU - Vilanova, Lluís

AU - Gelado, Isaac

AU - Jablin, Thomas B.

AU - Navarro, Nacho

AU - Hwu, Wen-Mei W

PY - 2015/6/8

Y1 - 2015/6/8

N2 - In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98× and 3.89× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

AB - In this paper we present AMGE, a programming framework and runtime system that transparently decomposes GPU kernels and executes them on multiple GPUs in parallel. AMGE exploits the remote memory access capability in modern GPUs to ensure that data can be accessed regardless of its physical location, allowing our runtime to safely decompose and distribute arrays across GPU memories. It optionally performs a compiler analysis that detects array access patterns in GPU kernels. Using this information, the runtime can perform more efficient computation and data distribution configurations than previous works. The GPU execution model allows AMGE to hide the cost of remote accesses if they are kept below 5%. We demonstrate that a thread block scheduling policy that distributes remote accesses through the whole kernel execution further reduces their overhead. Results show 1.98× and 3.89× execution speedups for 2 and 4 GPUs for a wide range of dense computations compared to the original versions on a single GPU.

KW - Multi-GPU programming

KW - NUMA

UR - http://www.scopus.com/inward/record.url?scp=84957557808&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84957557808&partnerID=8YFLogxK

U2 - 10.1145/2751205.2751218

DO - 10.1145/2751205.2751218

M3 - Conference contribution

AN - SCOPUS:84957557808

T3 - Proceedings of the International Conference on Supercomputing

SP - 3

EP - 13

BT - ICS 2015 - Proceedings of the 29th ACM International Conference on Supercomputing

PB - Association for Computing Machinery

ER -