Memory architecture and data locality

Wen mei W. Hwu, David B. Kirk, Izzat El Hajj

Research output: Chapter in Book/Report/Conference proceedingChapter


This chapter introduces the on-chip memory architecture of GPUs, the concept of memory-bound applications, and techniques for improving the performance of memory-bound applications. The chapter uses matrix multiplication to illustrate opportunities for reducing the number of global memory accesses. It then introduces the tiling technique by which barrier synchronization is used to coordinate the timing of executing threads for improved locality and reduced global memory accesses. However, the tiling techniques involve additional complexities in boundary checks. The chapter uses matrix multiplication to illustrate the additional boundary checks that are needed for a tiled kernel to be applicable to arbitrary matrix sizes. The chapter concludes with an overview of how usage of shared memory and registers can affect the number of thread blocks that can be accommodated in each streaming multiprocessor.

Original languageEnglish (US)
Title of host publicationProgramming Massively Parallel Processors
Subtitle of host publicationa Hands-on Approach, Fourth Edition
Number of pages29
ISBN (Electronic)9780323912310
ISBN (Print)9780323984638
StatePublished - Jan 1 2022


  • Memory bandwidth
  • cache
  • compute to global memory access ratio
  • lifetime
  • locality
  • memory latency
  • memory-bound
  • occupancy
  • on-chip memory
  • private memory
  • scope
  • shared memory
  • strip-mining
  • throughput
  • tiling

ASJC Scopus subject areas

  • General Computer Science


Dive into the research topics of 'Memory architecture and data locality'. Together they form a unique fingerprint.

Cite this