Abstract
This chapter introduces the on-chip memory architecture of GPUs, the concept of memory-bound applications, and techniques for improving the performance of memory-bound applications. The chapter uses matrix multiplication to illustrate opportunities for reducing the number of global memory accesses. It then introduces the tiling technique by which barrier synchronization is used to coordinate the timing of executing threads for improved locality and reduced global memory accesses. However, the tiling techniques involve additional complexities in boundary checks. The chapter uses matrix multiplication to illustrate the additional boundary checks that are needed for a tiled kernel to be applicable to arbitrary matrix sizes. The chapter concludes with an overview of how usage of shared memory and registers can affect the number of thread blocks that can be accommodated in each streaming multiprocessor.
Original language | English (US) |
---|---|
Title of host publication | Programming Massively Parallel Processors |
Subtitle of host publication | a Hands-on Approach, Fourth Edition |
Publisher | Elsevier |
Pages | 93-121 |
Number of pages | 29 |
ISBN (Electronic) | 9780323912310 |
ISBN (Print) | 9780323984638 |
DOIs | |
State | Published - Jan 1 2022 |
Keywords
- Memory bandwidth
- cache
- compute to global memory access ratio
- lifetime
- locality
- memory latency
- memory-bound
- occupancy
- on-chip memory
- private memory
- scope
- shared memory
- strip-mining
- throughput
- tiling
ASJC Scopus subject areas
- General Computer Science