Performance analysis and tuning for general purpose graphics processing units (GPGPU)

Hyesoon Kim, Richard Vuduc, Sara Baghsorkhi, Wen Mei Hwu, Choi Jee Choi

Research output: Chapter in Book/Report/Conference proceedingChapter

Abstract

General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques.

Original languageEnglish (US)
Title of host publicationPerformance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
Pages1-94
Number of pages94
DOIs
StatePublished - Nov 21 2012

Publication series

NameSynthesis Lectures on Computer Architecture
Volume20
ISSN (Print)1935-3235
ISSN (Electronic)1935-3243

Fingerprint

Tuning
Data storage equipment
Hardware
Cache memory
Memory architecture
Supercomputers
Parallel algorithms
Graphics processing unit
Bandwidth
Processing

Keywords

  • CUDA
  • GPGPU
  • GPU
  • performance analysis
  • performance modeling

ASJC Scopus subject areas

  • Hardware and Architecture

Cite this

Kim, H., Vuduc, R., Baghsorkhi, S., Hwu, W. M., & Jee Choi, C. (2012). Performance analysis and tuning for general purpose graphics processing units (GPGPU). In Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU) (pp. 1-94). (Synthesis Lectures on Computer Architecture; Vol. 20). https://doi.org/10.2200/S00451ED1V01Y201209CAC020

Performance analysis and tuning for general purpose graphics processing units (GPGPU). / Kim, Hyesoon; Vuduc, Richard; Baghsorkhi, Sara; Hwu, Wen Mei; Jee Choi, Choi.

Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU). 2012. p. 1-94 (Synthesis Lectures on Computer Architecture; Vol. 20).

Research output: Chapter in Book/Report/Conference proceedingChapter

Kim, H, Vuduc, R, Baghsorkhi, S, Hwu, WM & Jee Choi, C 2012, Performance analysis and tuning for general purpose graphics processing units (GPGPU). in Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU). Synthesis Lectures on Computer Architecture, vol. 20, pp. 1-94. https://doi.org/10.2200/S00451ED1V01Y201209CAC020
Kim H, Vuduc R, Baghsorkhi S, Hwu WM, Jee Choi C. Performance analysis and tuning for general purpose graphics processing units (GPGPU). In Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU). 2012. p. 1-94. (Synthesis Lectures on Computer Architecture). https://doi.org/10.2200/S00451ED1V01Y201209CAC020
Kim, Hyesoon ; Vuduc, Richard ; Baghsorkhi, Sara ; Hwu, Wen Mei ; Jee Choi, Choi. / Performance analysis and tuning for general purpose graphics processing units (GPGPU). Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU). 2012. pp. 1-94 (Synthesis Lectures on Computer Architecture).
@inbook{a4d3b053d8c94f12a69190339b09e672,
title = "Performance analysis and tuning for general purpose graphics processing units (GPGPU)",
abstract = "General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques.",
keywords = "CUDA, GPGPU, GPU, performance analysis, performance modeling",
author = "Hyesoon Kim and Richard Vuduc and Sara Baghsorkhi and Hwu, {Wen Mei} and {Jee Choi}, Choi",
year = "2012",
month = "11",
day = "21",
doi = "10.2200/S00451ED1V01Y201209CAC020",
language = "English (US)",
isbn = "9781608459544",
series = "Synthesis Lectures on Computer Architecture",
pages = "1--94",
booktitle = "Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)",

}

TY - CHAP

T1 - Performance analysis and tuning for general purpose graphics processing units (GPGPU)

AU - Kim, Hyesoon

AU - Vuduc, Richard

AU - Baghsorkhi, Sara

AU - Hwu, Wen Mei

AU - Jee Choi, Choi

PY - 2012/11/21

Y1 - 2012/11/21

N2 - General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques.

AB - General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques.

KW - CUDA

KW - GPGPU

KW - GPU

KW - performance analysis

KW - performance modeling

UR - http://www.scopus.com/inward/record.url?scp=84870493423&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84870493423&partnerID=8YFLogxK

U2 - 10.2200/S00451ED1V01Y201209CAC020

DO - 10.2200/S00451ED1V01Y201209CAC020

M3 - Chapter

AN - SCOPUS:84870493423

SN - 9781608459544

T3 - Synthesis Lectures on Computer Architecture

SP - 1

EP - 94

BT - Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)

ER -