Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

Javier Cabezas, Isaac Gelado, John E. Stone, Nacho Navarro, David B. Kirk, Wen-Mei W Hwu

Research output: Contribution to journalArticle

Abstract

Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2 ×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5 × for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

Original languageEnglish (US)
Article number6803940
Pages (from-to)1405-1418
Number of pages14
JournalIEEE Transactions on Parallel and Distributed Systems
Volume26
Issue number5
DOIs
StatePublished - May 1 2015

Fingerprint

Electronic data interchange
Particle accelerators
Hardware
Parallel programming
Maintainability
Dynamic mechanical analysis
Parallel processing systems
Fast Fourier transforms
Interfaces (computer)
Software engineering
Synchronization
Throughput
Availability
Bandwidth
Data storage equipment
Graphics processing unit
Communication
Costs

Keywords

  • Distributed architectures
  • data communications
  • hardware/software interfaces
  • heterogeneous (hybrid) systems

ASJC Scopus subject areas

  • Signal Processing
  • Hardware and Architecture
  • Computational Theory and Mathematics

Cite this

Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications. / Cabezas, Javier; Gelado, Isaac; Stone, John E.; Navarro, Nacho; Kirk, David B.; Hwu, Wen-Mei W.

In: IEEE Transactions on Parallel and Distributed Systems, Vol. 26, No. 5, 6803940, 01.05.2015, p. 1405-1418.

Research output: Contribution to journalArticle

Cabezas, Javier ; Gelado, Isaac ; Stone, John E. ; Navarro, Nacho ; Kirk, David B. ; Hwu, Wen-Mei W. / Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications. In: IEEE Transactions on Parallel and Distributed Systems. 2015 ; Vol. 26, No. 5. pp. 1405-1418.
@article{ccda5f65b3974340be56904d5f880556,
title = "Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications",
abstract = "Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2 ×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5 × for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.",
keywords = "Distributed architectures, data communications, hardware/software interfaces, heterogeneous (hybrid) systems",
author = "Javier Cabezas and Isaac Gelado and Stone, {John E.} and Nacho Navarro and Kirk, {David B.} and Hwu, {Wen-Mei W}",
year = "2015",
month = "5",
day = "1",
doi = "10.1109/TPDS.2014.2316825",
language = "English (US)",
volume = "26",
pages = "1405--1418",
journal = "IEEE Transactions on Parallel and Distributed Systems",
issn = "1045-9219",
publisher = "IEEE Computer Society",
number = "5",

}

TY - JOUR

T1 - Runtime and Architecture Support for Efficient Data Exchange in Multi-Accelerator Applications

AU - Cabezas, Javier

AU - Gelado, Isaac

AU - Stone, John E.

AU - Navarro, Nacho

AU - Kirk, David B.

AU - Hwu, Wen-Mei W

PY - 2015/5/1

Y1 - 2015/5/1

N2 - Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2 ×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5 × for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

AB - Heterogeneous parallel computing applications often process large data sets that require multiple GPUs to jointly meet their needs for physical memory capacity and compute throughput. However, the lack of high-level abstractions in previous heterogeneous parallel programming models force programmers to resort to multiple code versions, complex data copy steps and synchronization schemes when exchanging data between multiple GPU devices, which results in high software development cost, poor maintainability, and even poor performance. This paper describes the HPE runtime system, and the associated architecture support, which enables a simple, efficient programming interface for exchanging data between multiple GPUs through either interconnects or cross-node network interfaces. The runtime and architecture support presented in this paper can also be used to support other types of accelerators. We show that the simplified programming interface reduces programming complexity. The research presented in this paper started in 2009. It has been implemented and tested extensively in several generations of HPE runtime systems as well as adopted into the NVIDIA GPU hardware and drivers for CUDA 4.0 and beyond since 2011. The availability of real hardware that support key HPE features gives rise to a rare opportunity for studying the effectiveness of the hardware support by running important benchmarks on real runtime and hardware. Experimental results show that in a exemplar heterogeneous system, peer DMA and double-buffering, pinned buffers, and software techniques can improve the inter-accelerator data communication bandwidth by 2 ×. They can also improve the execution speed by 1.6× for a 3D finite difference, 2.5 × for 1D FFT, and 1.6× for merge sort, all measured on real hardware. The proposed architecture support enables the HPE runtime to transparently deploy these optimizations under simple portable user code, allowing system designers to freely employ devices of different capabilities. We further argue that simple interfaces such as HPE are needed for most applications to benefit from advanced hardware features in practice.

KW - Distributed architectures

KW - data communications

KW - hardware/software interfaces

KW - heterogeneous (hybrid) systems

UR - http://www.scopus.com/inward/record.url?scp=84927602648&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84927602648&partnerID=8YFLogxK

U2 - 10.1109/TPDS.2014.2316825

DO - 10.1109/TPDS.2014.2316825

M3 - Article

AN - SCOPUS:84927602648

VL - 26

SP - 1405

EP - 1418

JO - IEEE Transactions on Parallel and Distributed Systems

JF - IEEE Transactions on Parallel and Distributed Systems

SN - 1045-9219

IS - 5

M1 - 6803940

ER -