Optimization and architecture effects on GPU computing workload performance

John A. Stratton, Nasser Anssari, Christopher Rodrigues, I. Jui Sung, Nady Obeid, Liwen Chang, Geng Daniel Liu, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

It is unquestionable that successive hardware generations have significantly improved GPU computing workload performance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruction throughput by 3X and off-chip bandwidth by 2.2X from NVIDIA's GeForce 8800 GTX in November 2006 to its GeForce GTX 580 in November 2010. However, raw capability numbers typically underestimate the improvements in real application performance over the same time period, due to significant architectural feature improvements. To demonstrate the effects of architecture features and optimizations over time, we conducted experiments on a set of benchmarks from diverse application domains for multiple GPU architecture generations to understand how much performance has truly been improving for those workloads. First, we demonstrate that certain architectural features make a huge difference in the performance of unoptimized code, such as the inclusion of a general cache which can improve performance by 2-4x in some situations. Second, we describe what optimization patterns have been most essential and widely applicable for improving performance for GPU computing workloads across all architecture generations. Some important optimization patterns included data layout transformation, converting scatter accesses to gather accesses, GPU workload regularization, and granularity coarsening, each of which improved performance on some benchmark by over 20%, sometimes by a factor of more than 5x. While hardware improvements to baseline unoptimized code can reduce the speedup magnitude, these patterns remain important for even the most recent GPUs. Finally, we identify which added architectural features created significant new optimization opportunities, such as increased register file capacity or reduced bandwidth penalties for misaligned accesses, which increase performance by 2x or more in the optimized versions of relevant benchmarks.

Original languageEnglish (US)
Title of host publication2012 Innovative Parallel Computing, InPar 2012
DOIs
StatePublished - Dec 12 2012
Event2012 Innovative Parallel Computing, InPar 2012 - San Jose, CA, United States
Duration: May 13 2012May 14 2012

Publication series

Name2012 Innovative Parallel Computing, InPar 2012

Other

Other2012 Innovative Parallel Computing, InPar 2012
CountryUnited States
CitySan Jose, CA
Period5/13/125/14/12

Fingerprint

Hardware
Bandwidth
Dynamic random access storage
Coarsening
Throughput
Graphics processing unit
Experiments

Keywords

  • CUDA
  • GPU
  • Optimization

ASJC Scopus subject areas

  • Software

Cite this

Stratton, J. A., Anssari, N., Rodrigues, C., Sung, I. J., Obeid, N., Chang, L., ... Hwu, W-M. W. (2012). Optimization and architecture effects on GPU computing workload performance. In 2012 Innovative Parallel Computing, InPar 2012 [6339605] (2012 Innovative Parallel Computing, InPar 2012). https://doi.org/10.1109/InPar.2012.6339605

Optimization and architecture effects on GPU computing workload performance. / Stratton, John A.; Anssari, Nasser; Rodrigues, Christopher; Sung, I. Jui; Obeid, Nady; Chang, Liwen; Liu, Geng Daniel; Hwu, Wen-Mei W.

2012 Innovative Parallel Computing, InPar 2012. 2012. 6339605 (2012 Innovative Parallel Computing, InPar 2012).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Stratton, JA, Anssari, N, Rodrigues, C, Sung, IJ, Obeid, N, Chang, L, Liu, GD & Hwu, W-MW 2012, Optimization and architecture effects on GPU computing workload performance. in 2012 Innovative Parallel Computing, InPar 2012., 6339605, 2012 Innovative Parallel Computing, InPar 2012, 2012 Innovative Parallel Computing, InPar 2012, San Jose, CA, United States, 5/13/12. https://doi.org/10.1109/InPar.2012.6339605
Stratton JA, Anssari N, Rodrigues C, Sung IJ, Obeid N, Chang L et al. Optimization and architecture effects on GPU computing workload performance. In 2012 Innovative Parallel Computing, InPar 2012. 2012. 6339605. (2012 Innovative Parallel Computing, InPar 2012). https://doi.org/10.1109/InPar.2012.6339605
Stratton, John A. ; Anssari, Nasser ; Rodrigues, Christopher ; Sung, I. Jui ; Obeid, Nady ; Chang, Liwen ; Liu, Geng Daniel ; Hwu, Wen-Mei W. / Optimization and architecture effects on GPU computing workload performance. 2012 Innovative Parallel Computing, InPar 2012. 2012. (2012 Innovative Parallel Computing, InPar 2012).
@inproceedings{5170d9d2d1a343fd95e583743db7fba9,
title = "Optimization and architecture effects on GPU computing workload performance",
abstract = "It is unquestionable that successive hardware generations have significantly improved GPU computing workload performance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruction throughput by 3X and off-chip bandwidth by 2.2X from NVIDIA's GeForce 8800 GTX in November 2006 to its GeForce GTX 580 in November 2010. However, raw capability numbers typically underestimate the improvements in real application performance over the same time period, due to significant architectural feature improvements. To demonstrate the effects of architecture features and optimizations over time, we conducted experiments on a set of benchmarks from diverse application domains for multiple GPU architecture generations to understand how much performance has truly been improving for those workloads. First, we demonstrate that certain architectural features make a huge difference in the performance of unoptimized code, such as the inclusion of a general cache which can improve performance by 2-4x in some situations. Second, we describe what optimization patterns have been most essential and widely applicable for improving performance for GPU computing workloads across all architecture generations. Some important optimization patterns included data layout transformation, converting scatter accesses to gather accesses, GPU workload regularization, and granularity coarsening, each of which improved performance on some benchmark by over 20{\%}, sometimes by a factor of more than 5x. While hardware improvements to baseline unoptimized code can reduce the speedup magnitude, these patterns remain important for even the most recent GPUs. Finally, we identify which added architectural features created significant new optimization opportunities, such as increased register file capacity or reduced bandwidth penalties for misaligned accesses, which increase performance by 2x or more in the optimized versions of relevant benchmarks.",
keywords = "CUDA, GPU, Optimization",
author = "Stratton, {John A.} and Nasser Anssari and Christopher Rodrigues and Sung, {I. Jui} and Nady Obeid and Liwen Chang and Liu, {Geng Daniel} and Hwu, {Wen-Mei W}",
year = "2012",
month = "12",
day = "12",
doi = "10.1109/InPar.2012.6339605",
language = "English (US)",
isbn = "9781467326322",
series = "2012 Innovative Parallel Computing, InPar 2012",
booktitle = "2012 Innovative Parallel Computing, InPar 2012",

}

TY - GEN

T1 - Optimization and architecture effects on GPU computing workload performance

AU - Stratton, John A.

AU - Anssari, Nasser

AU - Rodrigues, Christopher

AU - Sung, I. Jui

AU - Obeid, Nady

AU - Chang, Liwen

AU - Liu, Geng Daniel

AU - Hwu, Wen-Mei W

PY - 2012/12/12

Y1 - 2012/12/12

N2 - It is unquestionable that successive hardware generations have significantly improved GPU computing workload performance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruction throughput by 3X and off-chip bandwidth by 2.2X from NVIDIA's GeForce 8800 GTX in November 2006 to its GeForce GTX 580 in November 2010. However, raw capability numbers typically underestimate the improvements in real application performance over the same time period, due to significant architectural feature improvements. To demonstrate the effects of architecture features and optimizations over time, we conducted experiments on a set of benchmarks from diverse application domains for multiple GPU architecture generations to understand how much performance has truly been improving for those workloads. First, we demonstrate that certain architectural features make a huge difference in the performance of unoptimized code, such as the inclusion of a general cache which can improve performance by 2-4x in some situations. Second, we describe what optimization patterns have been most essential and widely applicable for improving performance for GPU computing workloads across all architecture generations. Some important optimization patterns included data layout transformation, converting scatter accesses to gather accesses, GPU workload regularization, and granularity coarsening, each of which improved performance on some benchmark by over 20%, sometimes by a factor of more than 5x. While hardware improvements to baseline unoptimized code can reduce the speedup magnitude, these patterns remain important for even the most recent GPUs. Finally, we identify which added architectural features created significant new optimization opportunities, such as increased register file capacity or reduced bandwidth penalties for misaligned accesses, which increase performance by 2x or more in the optimized versions of relevant benchmarks.

AB - It is unquestionable that successive hardware generations have significantly improved GPU computing workload performance over the last several years. Moore's law and DRAM scaling have respectively increased single-chip peak instruction throughput by 3X and off-chip bandwidth by 2.2X from NVIDIA's GeForce 8800 GTX in November 2006 to its GeForce GTX 580 in November 2010. However, raw capability numbers typically underestimate the improvements in real application performance over the same time period, due to significant architectural feature improvements. To demonstrate the effects of architecture features and optimizations over time, we conducted experiments on a set of benchmarks from diverse application domains for multiple GPU architecture generations to understand how much performance has truly been improving for those workloads. First, we demonstrate that certain architectural features make a huge difference in the performance of unoptimized code, such as the inclusion of a general cache which can improve performance by 2-4x in some situations. Second, we describe what optimization patterns have been most essential and widely applicable for improving performance for GPU computing workloads across all architecture generations. Some important optimization patterns included data layout transformation, converting scatter accesses to gather accesses, GPU workload regularization, and granularity coarsening, each of which improved performance on some benchmark by over 20%, sometimes by a factor of more than 5x. While hardware improvements to baseline unoptimized code can reduce the speedup magnitude, these patterns remain important for even the most recent GPUs. Finally, we identify which added architectural features created significant new optimization opportunities, such as increased register file capacity or reduced bandwidth penalties for misaligned accesses, which increase performance by 2x or more in the optimized versions of relevant benchmarks.

KW - CUDA

KW - GPU

KW - Optimization

UR - http://www.scopus.com/inward/record.url?scp=84870714427&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84870714427&partnerID=8YFLogxK

U2 - 10.1109/InPar.2012.6339605

DO - 10.1109/InPar.2012.6339605

M3 - Conference contribution

AN - SCOPUS:84870714427

SN - 9781467326322

T3 - 2012 Innovative Parallel Computing, InPar 2012

BT - 2012 Innovative Parallel Computing, InPar 2012

ER -