Implementing a GPU programming model on a non-GPU accelerator architecture

Stephen M. Kofsky, Daniel R. Johnson, John A. Stratton, Wen Mei W. Hwu, Sanjay J. Patel, Steven S. Lumetta

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.

Original languageEnglish (US)
Title of host publicationComputer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers
Pages40-51
Number of pages12
DOIs
StatePublished - Mar 8 2012
EventACM IEEE International Symposium on Computer Architecture, ISCA 2011 - Saint-Malo, France
Duration: Jun 19 2010Jun 23 2010

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume6161 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

OtherACM IEEE International Symposium on Computer Architecture, ISCA 2011
CountryFrance
CitySaint-Malo
Period6/19/106/23/10

Fingerprint

Accelerator
Programming Model
Particle accelerators
Parallel architectures
Degradation
Many-core
Optimization
Parallel Architectures
Portability
Parallel Systems
Architecture
Graphics processing unit
Exception
Chip
High Performance
Demonstrate

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Kofsky, S. M., Johnson, D. R., Stratton, J. A., Hwu, W. M. W., Patel, S. J., & Lumetta, S. S. (2012). Implementing a GPU programming model on a non-GPU accelerator architecture. In Computer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers (pp. 40-51). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6161 LNCS). https://doi.org/10.1007/978-3-642-24322-6_5

Implementing a GPU programming model on a non-GPU accelerator architecture. / Kofsky, Stephen M.; Johnson, Daniel R.; Stratton, John A.; Hwu, Wen Mei W.; Patel, Sanjay J.; Lumetta, Steven S.

Computer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers. 2012. p. 40-51 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 6161 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Kofsky, SM, Johnson, DR, Stratton, JA, Hwu, WMW, Patel, SJ & Lumetta, SS 2012, Implementing a GPU programming model on a non-GPU accelerator architecture. in Computer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 6161 LNCS, pp. 40-51, ACM IEEE International Symposium on Computer Architecture, ISCA 2011, Saint-Malo, France, 6/19/10. https://doi.org/10.1007/978-3-642-24322-6_5
Kofsky SM, Johnson DR, Stratton JA, Hwu WMW, Patel SJ, Lumetta SS. Implementing a GPU programming model on a non-GPU accelerator architecture. In Computer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers. 2012. p. 40-51. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-642-24322-6_5
Kofsky, Stephen M. ; Johnson, Daniel R. ; Stratton, John A. ; Hwu, Wen Mei W. ; Patel, Sanjay J. ; Lumetta, Steven S. / Implementing a GPU programming model on a non-GPU accelerator architecture. Computer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers. 2012. pp. 40-51 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{b0800af1707b4264a08e1ccc2a3577f9,
title = "Implementing a GPU programming model on a non-GPU accelerator architecture",
abstract = "Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over na{\"i}ve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.",
author = "Kofsky, {Stephen M.} and Johnson, {Daniel R.} and Stratton, {John A.} and Hwu, {Wen Mei W.} and Patel, {Sanjay J.} and Lumetta, {Steven S.}",
year = "2012",
month = "3",
day = "8",
doi = "10.1007/978-3-642-24322-6_5",
language = "English (US)",
isbn = "9783642243219",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "40--51",
booktitle = "Computer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers",

}

TY - GEN

T1 - Implementing a GPU programming model on a non-GPU accelerator architecture

AU - Kofsky, Stephen M.

AU - Johnson, Daniel R.

AU - Stratton, John A.

AU - Hwu, Wen Mei W.

AU - Patel, Sanjay J.

AU - Lumetta, Steven S.

PY - 2012/3/8

Y1 - 2012/3/8

N2 - Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.

AB - Parallel codes are written primarily for the purpose of performance. It is highly desirable that parallel codes be portable between parallel architectures without significant performance degradation or code rewrites. While performance portability and its limits have been studied thoroughly on single processor systems, this goal has been less extensively studied and is more difficult to achieve for parallel systems. Emerging single-chip parallel platforms are no exception; writing code that obtains good performance across GPUs and other many-core CMPs can be challenging. In this paper, we focus on CUDA codes, noting that programs must obey a number of constraints to achieve high performance on an NVIDIA GPU. Under such constraints, we develop optimizations that improve the performance of CUDA code on a MIMD accelerator architecture that we are developing called Rigel. We demonstrate performance improvements with these optimizations over naïve translations, and final performance results comparable to those of codes that were hand-optimized for Rigel.

UR - http://www.scopus.com/inward/record.url?scp=84863247559&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84863247559&partnerID=8YFLogxK

U2 - 10.1007/978-3-642-24322-6_5

DO - 10.1007/978-3-642-24322-6_5

M3 - Conference contribution

AN - SCOPUS:84863247559

SN - 9783642243219

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 40

EP - 51

BT - Computer Architecture - ISCA 2010 International Workshops, A4MMC, AMAS-BT, EAMA, WEED, WIOSCA, Revised Selected Papers

ER -