Program optimization space pruning for a multithreaded GPU

Shane Ryoo, Christopher I. Rodrigues, Sam S. Stone, Sara S. Baghsorkhi, Sain Zee Ueng, John A. Stratton, Wen Mei W. Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process. This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.

Original languageEnglish (US)
Title of host publicationProceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization
Pages195-204
Number of pages10
DOIs
StatePublished - May 19 2008

Publication series

NameProceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization

Fingerprint

Graphics processing unit
Tuning
Inspection

Keywords

  • GPGPU
  • Optimization
  • Parallel computing

ASJC Scopus subject areas

  • Software

Cite this

Ryoo, S., Rodrigues, C. I., Stone, S. S., Baghsorkhi, S. S., Ueng, S. Z., Stratton, J. A., & Hwu, W. M. W. (2008). Program optimization space pruning for a multithreaded GPU. In Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization (pp. 195-204). (Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization). https://doi.org/10.1145/1356058.1356084

Program optimization space pruning for a multithreaded GPU. / Ryoo, Shane; Rodrigues, Christopher I.; Stone, Sam S.; Baghsorkhi, Sara S.; Ueng, Sain Zee; Stratton, John A.; Hwu, Wen Mei W.

Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization. 2008. p. 195-204 (Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Ryoo, S, Rodrigues, CI, Stone, SS, Baghsorkhi, SS, Ueng, SZ, Stratton, JA & Hwu, WMW 2008, Program optimization space pruning for a multithreaded GPU. in Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization. Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization, pp. 195-204. https://doi.org/10.1145/1356058.1356084
Ryoo S, Rodrigues CI, Stone SS, Baghsorkhi SS, Ueng SZ, Stratton JA et al. Program optimization space pruning for a multithreaded GPU. In Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization. 2008. p. 195-204. (Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization). https://doi.org/10.1145/1356058.1356084
Ryoo, Shane ; Rodrigues, Christopher I. ; Stone, Sam S. ; Baghsorkhi, Sara S. ; Ueng, Sain Zee ; Stratton, John A. ; Hwu, Wen Mei W. / Program optimization space pruning for a multithreaded GPU. Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization. 2008. pp. 195-204 (Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization).
@inproceedings{1a4c9e9db1d84b87b61e967be843f540,
title = "Program optimization space pruning for a multithreaded GPU",
abstract = "Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process. This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98{\%} and still finds the optimal configuration for each of the studied applications.",
keywords = "GPGPU, Optimization, Parallel computing",
author = "Shane Ryoo and Rodrigues, {Christopher I.} and Stone, {Sam S.} and Baghsorkhi, {Sara S.} and Ueng, {Sain Zee} and Stratton, {John A.} and Hwu, {Wen Mei W.}",
year = "2008",
month = "5",
day = "19",
doi = "10.1145/1356058.1356084",
language = "English (US)",
isbn = "9781595939784",
series = "Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization",
pages = "195--204",
booktitle = "Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization",

}

TY - GEN

T1 - Program optimization space pruning for a multithreaded GPU

AU - Ryoo, Shane

AU - Rodrigues, Christopher I.

AU - Stone, Sam S.

AU - Baghsorkhi, Sara S.

AU - Ueng, Sain Zee

AU - Stratton, John A.

AU - Hwu, Wen Mei W.

PY - 2008/5/19

Y1 - 2008/5/19

N2 - Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process. This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.

AB - Program optimization for highly-parallel systems has historically been considered an art, with experts doing much of the performance tuning by hand. With the introduction of inexpensive, single-chip, massively parallel platforms, more developers will be creating highly-parallel applications for these platforms, who lack the substantial experience and knowledge needed to maximize their performance. This creates a need for more structured optimization methods with means to estimate their performance effects. Furthermore these methods need to be understandable by most programmers. This paper shows the complexity involved in optimizing applications for one such system and one relatively simple methodology for reducing the workload involved in the optimization process. This work is based on one such highly-parallel system, the GeForce 8800 GTX using CUDA. Its flexible allocation of resources to threads allows it to extract performance from a range of applications with varying resource requirements, but places new demands on developers who seek to maximize an application's performance. We show how optimizations interact with the architecture in complex ways, initially prompting an inspection of the entire configuration space to find the optimal configuration. Even for a seemingly simple application such as matrix multiplication, the optimal configuration can be unexpected. We then present metrics derived from static code that capture the first-order factors of performance. We demonstrate how these metrics can be used to prune many optimization configurations, down to those that lie on a Pareto-optimal curve. This reduces the optimization space by as much as 98% and still finds the optimal configuration for each of the studied applications.

KW - GPGPU

KW - Optimization

KW - Parallel computing

UR - http://www.scopus.com/inward/record.url?scp=43449094719&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=43449094719&partnerID=8YFLogxK

U2 - 10.1145/1356058.1356084

DO - 10.1145/1356058.1356084

M3 - Conference contribution

AN - SCOPUS:43449094719

SN - 9781595939784

T3 - Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization

SP - 195

EP - 204

BT - Proceedings of the 2008 CGO - Sixth International Symposium on Code Generation and Optimization

ER -