Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects

Carl Pearson, Abdul Dakkak, Sarah Hashash, Cheng Li, I. Hsin Chung, Jinjun Xiong, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Data-intensive applications such as machine learning and analyt-ics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.

Original languageEnglish (US)
Title of host publicationICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering
PublisherAssociation for Computing Machinery, Inc
Pages209-218
Number of pages10
ISBN (Electronic)9781450362399
DOIs
StatePublished - Apr 4 2019
Event10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019 - Mumbai, India
Duration: Apr 7 2019Apr 11 2019

Publication series

NameICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

Conference

Conference10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019
CountryIndia
CityMumbai
Period4/7/194/11/19

Fingerprint

Data transfer
Bandwidth
Program processors
Communication
Data storage equipment
Copying
Computer hardware
Learning systems
Clocks
Synchronization
Computer systems
Systems analysis
Costs

Keywords

  • Benchmarking
  • CUDA
  • GPU
  • NUMA
  • NVLink
  • POWER
  • X86

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software
  • Computer Science Applications

Cite this

Pearson, C., Dakkak, A., Hashash, S., Li, C., Chung, I. H., Xiong, J., & Hwu, W-M. W. (2019). Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. In ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering (pp. 209-218). (ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering). Association for Computing Machinery, Inc. https://doi.org/10.1145/3297663.3310299

Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. / Pearson, Carl; Dakkak, Abdul; Hashash, Sarah; Li, Cheng; Chung, I. Hsin; Xiong, Jinjun; Hwu, Wen-Mei W.

ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, Inc, 2019. p. 209-218 (ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pearson, C, Dakkak, A, Hashash, S, Li, C, Chung, IH, Xiong, J & Hwu, W-MW 2019, Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. in ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering, Association for Computing Machinery, Inc, pp. 209-218, 10th ACM/SPEC International Conference on Performance Engineering, ICPE 2019, Mumbai, India, 4/7/19. https://doi.org/10.1145/3297663.3310299
Pearson C, Dakkak A, Hashash S, Li C, Chung IH, Xiong J et al. Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. In ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, Inc. 2019. p. 209-218. (ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering). https://doi.org/10.1145/3297663.3310299
Pearson, Carl ; Dakkak, Abdul ; Hashash, Sarah ; Li, Cheng ; Chung, I. Hsin ; Xiong, Jinjun ; Hwu, Wen-Mei W. / Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects. ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering. Association for Computing Machinery, Inc, 2019. pp. 209-218 (ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering).
@inproceedings{b47d22ebbb514da9acfe54e82105627a,
title = "Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects",
abstract = "Data-intensive applications such as machine learning and analyt-ics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.",
keywords = "Benchmarking, CUDA, GPU, NUMA, NVLink, POWER, X86",
author = "Carl Pearson and Abdul Dakkak and Sarah Hashash and Cheng Li and Chung, {I. Hsin} and Jinjun Xiong and Hwu, {Wen-Mei W}",
year = "2019",
month = "4",
day = "4",
doi = "10.1145/3297663.3310299",
language = "English (US)",
series = "ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering",
publisher = "Association for Computing Machinery, Inc",
pages = "209--218",
booktitle = "ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering",

}

TY - GEN

T1 - Evaluating characteristics of CUDA communication primitives on high-bandwidth interconnects

AU - Pearson, Carl

AU - Dakkak, Abdul

AU - Hashash, Sarah

AU - Li, Cheng

AU - Chung, I. Hsin

AU - Xiong, Jinjun

AU - Hwu, Wen-Mei W

PY - 2019/4/4

Y1 - 2019/4/4

N2 - Data-intensive applications such as machine learning and analyt-ics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.

AB - Data-intensive applications such as machine learning and analyt-ics have created a demand for faster interconnects to avert the memory bandwidth wall and allow GPUs to be effectively leveraged for lower compute intensity tasks. This has resulted in wide adoption of heterogeneous systems with varying underlying interconnects, and has delegated the task of understanding and copying data to the system or application developer. No longer is a malloc followed by memcpy the only or dominating modality of data transfer; application developers are faced with additional options such as unified memory and zero-copy memory. Data transfer performance on these systems is now impacted by many factors including data transfer modality, system interconnect hardware details, CPU caching state, CPU power management state, driver policies, virtual memory paging efficiency, and data placement. This paper presents Comm|Scope, a set of microbenchmarks designed for system and application developers to understand memory transfer behavior across different data placement and exchange scenarios. Comm|Scope comprehensively measures the latency and bandwidth of CUDA data transfer primitives, and avoids common pitfalls in ad-hoc measurements by controlling CPU caches, clock frequencies, and avoids measuring synchronization costs imposed by the measurement methodology where possible. This paper also presents an evaluation of Comm|Scope on systems featuring the POWER and x86 CPU architectures and PCIe 3, NVLink 1, and NVLink 2 interconnects. These systems are chosen as representative configurations of current high-performance GPU platforms. Comm|Scope measurements can serve to update insights about the relative performance of data transfer methods on current systems. This work also reports insights for how high-level system design choices affect the performance of these data transfers, and how developers can optimize applications on these systems.

KW - Benchmarking

KW - CUDA

KW - GPU

KW - NUMA

KW - NVLink

KW - POWER

KW - X86

UR - http://www.scopus.com/inward/record.url?scp=85064824155&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064824155&partnerID=8YFLogxK

U2 - 10.1145/3297663.3310299

DO - 10.1145/3297663.3310299

M3 - Conference contribution

AN - SCOPUS:85064824155

T3 - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

SP - 209

EP - 218

BT - ICPE 2019 - Proceedings of the 2019 ACM/SPEC International Conference on Performance Engineering

PB - Association for Computing Machinery, Inc

ER -