GPU-SM: Shared memory multi-GPU programming

Javier Cabezas, Marc Jordà, Isaac Gelado, Nacho Navarro, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g., a 40% SLOC reduction in the host code of finite difference).

Original languageEnglish (US)
Title of host publicationACM International Conference Proceeding Series
EditorsXiang Gong
PublisherAssociation for Computing Machinery
Pages13-24
Number of pages12
ISBN (Electronic)9781450334075
DOIs
StatePublished - Feb 7 2015
Event8th Annual Workshop on General Purpose Processing using Graphics Processing Unit, GPGPU 2015 - San Francisco, United States
Duration: Feb 7 2015 → …

Publication series

NameACM International Conference Proceeding Series
Volume2015-February

Other

Other8th Annual Workshop on General Purpose Processing using Graphics Processing Unit, GPGPU 2015
CountryUnited States
CitySan Francisco
Period2/7/15 → …

Fingerprint

Computer programming
Data storage equipment
Graphics processing unit
Computer systems
Data structures

Keywords

  • GPGPU
  • I/O interconnects
  • Shared memory machines

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Cabezas, J., Jordà, M., Gelado, I., Navarro, N., & Hwu, W-M. W. (2015). GPU-SM: Shared memory multi-GPU programming. In X. Gong (Ed.), ACM International Conference Proceeding Series (pp. 13-24). (ACM International Conference Proceeding Series; Vol. 2015-February). Association for Computing Machinery. https://doi.org/10.1145/2716282.2716286

GPU-SM : Shared memory multi-GPU programming. / Cabezas, Javier; Jordà, Marc; Gelado, Isaac; Navarro, Nacho; Hwu, Wen-Mei W.

ACM International Conference Proceeding Series. ed. / Xiang Gong. Association for Computing Machinery, 2015. p. 13-24 (ACM International Conference Proceeding Series; Vol. 2015-February).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Cabezas, J, Jordà, M, Gelado, I, Navarro, N & Hwu, W-MW 2015, GPU-SM: Shared memory multi-GPU programming. in X Gong (ed.), ACM International Conference Proceeding Series. ACM International Conference Proceeding Series, vol. 2015-February, Association for Computing Machinery, pp. 13-24, 8th Annual Workshop on General Purpose Processing using Graphics Processing Unit, GPGPU 2015, San Francisco, United States, 2/7/15. https://doi.org/10.1145/2716282.2716286
Cabezas J, Jordà M, Gelado I, Navarro N, Hwu W-MW. GPU-SM: Shared memory multi-GPU programming. In Gong X, editor, ACM International Conference Proceeding Series. Association for Computing Machinery. 2015. p. 13-24. (ACM International Conference Proceeding Series). https://doi.org/10.1145/2716282.2716286
Cabezas, Javier ; Jordà, Marc ; Gelado, Isaac ; Navarro, Nacho ; Hwu, Wen-Mei W. / GPU-SM : Shared memory multi-GPU programming. ACM International Conference Proceeding Series. editor / Xiang Gong. Association for Computing Machinery, 2015. pp. 13-24 (ACM International Conference Proceeding Series).
@inproceedings{2d605f06c95149e1ae2ee185bcf22a9c,
title = "GPU-SM: Shared memory multi-GPU programming",
abstract = "Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g., a 40{\%} SLOC reduction in the host code of finite difference).",
keywords = "GPGPU, I/O interconnects, Shared memory machines",
author = "Javier Cabezas and Marc Jord{\`a} and Isaac Gelado and Nacho Navarro and Hwu, {Wen-Mei W}",
year = "2015",
month = "2",
day = "7",
doi = "10.1145/2716282.2716286",
language = "English (US)",
series = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",
pages = "13--24",
editor = "Xiang Gong",
booktitle = "ACM International Conference Proceeding Series",

}

TY - GEN

T1 - GPU-SM

T2 - Shared memory multi-GPU programming

AU - Cabezas, Javier

AU - Jordà, Marc

AU - Gelado, Isaac

AU - Navarro, Nacho

AU - Hwu, Wen-Mei W

PY - 2015/2/7

Y1 - 2015/2/7

N2 - Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g., a 40% SLOC reduction in the host code of finite difference).

AB - Discrete GPUs in modern multi-GPU systems can transparently access each other's memories through the PCIe interconnect. Future systems will improve this capability by including better GPU interconnects such as NVLink. However, remote memory access across GPUs has gone largely unnoticed among programmers, and multi-GPU systems are still programmed like distributed systems in which each GPU only accesses its own memory. This increases the complexity of the host code as programmers need to explicitly communicate data across GPU memories. In this paper we present GPU-SM, a set of guidelines to program multi-GPU systems like NUMA shared memory systems with minimal performance overheads. Using GPU-SM, data structures can be decomposed across several GPU memories and data that resides on a different GPU is accessed remotely through the PCI interconnect. The programmability benefits of the shared-memory model on GPUs are shown using a finite difference and an image filtering applications. We also present a detailed performance analysis of the PCIe interconnect and the impact of remote accesses on kernel performance. While PCIe imposes long latency and has limited bandwidth compared to the local GPU memory, we show that the highly-multithreaded GPU execution model can help reducing its costs. Evaluation of finite difference and image filtering GPU-SM implementations shows close to linear speedups on a system with 4 GPUs, with much simpler code than the original implementations (e.g., a 40% SLOC reduction in the host code of finite difference).

KW - GPGPU

KW - I/O interconnects

KW - Shared memory machines

UR - http://www.scopus.com/inward/record.url?scp=84938871288&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84938871288&partnerID=8YFLogxK

U2 - 10.1145/2716282.2716286

DO - 10.1145/2716282.2716286

M3 - Conference contribution

AN - SCOPUS:84938871288

T3 - ACM International Conference Proceeding Series

SP - 13

EP - 24

BT - ACM International Conference Proceeding Series

A2 - Gong, Xiang

PB - Association for Computing Machinery

ER -