TY - CHAP
T1 - Performance analysis and tuning for general purpose graphics processing units (GPGPU)
AU - Kim, Hyesoon
AU - Vuduc, Richard
AU - Baghsorkhi, Sara
AU - Hwu, Wen Mei
AU - Jee Choi, Choi
PY - 2012/11/21
Y1 - 2012/11/21
N2 - General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques.
AB - General-purpose graphics processing units (GPGPU) have emerged as an important class of shared memory parallel processing architectures, with widespread deployment in every computer class from high-end supercomputers to embedded mobile platforms. Relative to more traditional multicore systems of today, GPGPUs have distinctly higher degrees of hardware multithreading (hundreds of hardware thread contexts vs. tens), a return to wide vector units (several tens vs. 1-10), memory architectures that deliver higher peak memory bandwidth (hundreds of gigabytes per second vs. tens), and smaller caches/scratchpad memories (less than 1 megabyte vs. 1-10 megabytes). In this book, we provide a high-level overview of current GPGPU architectures and programming models.We review the principles that are used in previous shared memory parallel platforms, focusing on recent results in both the theory and practice of parallel algorithms, and suggest a connection to GPGPU platforms.We aim to provide hints to architects about understanding algorithm aspect to GPGPU. We also provide detailed performance analysis and guide optimizations from high-level algorithms to low-level instruction level optimizations. As a case study, we use n-body particle simulations known as the fast multipole method (FMM) as an example. We also briefly survey the state-of-the-art in GPU performance analysis tools and techniques.
KW - CUDA
KW - GPGPU
KW - GPU
KW - performance analysis
KW - performance modeling
UR - http://www.scopus.com/inward/record.url?scp=84870493423&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84870493423&partnerID=8YFLogxK
U2 - 10.2200/S00451ED1V01Y201209CAC020
DO - 10.2200/S00451ED1V01Y201209CAC020
M3 - Chapter
AN - SCOPUS:84870493423
SN - 9781608459544
T3 - Synthesis Lectures on Computer Architecture
SP - 1
EP - 94
BT - Performance Analysis and Tuning for General Purpose Graphics Processing Units (GPGPU)
ER -