MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs

John A. Stratton, Sam S. Stone, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures.

Original languageEnglish (US)
Title of host publicationLanguages and Compilers for Parallel Computing - 21st International Workshop, LCPC 2008, Revised Selected Papers
Pages16-30
Number of pages15
DOIs
StatePublished - Dec 1 2008
Event21st International Workshop on Languages and Compilers for Parallel Computing, LCPC 2008 - Edmonton, AB, Canada
Duration: Jul 31 2008Aug 2 2008

Publication series

NameLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Volume5335 LNCS
ISSN (Print)0302-9743
ISSN (Electronic)1611-3349

Other

Other21st International Workshop on Languages and Compilers for Parallel Computing, LCPC 2008
CountryCanada
CityEdmonton, AB
Period7/31/088/2/08

Fingerprint

Efficient Implementation
Program processors
Parallel programming
Parallel Programming
kernel
Compiler
Thread
Programming Model
Synchronization
Data storage equipment
Runtime Systems
Shared Memory
Convert
Eliminate
Programming
Semantics
Scalar
Transform
Demonstrate
Framework

ASJC Scopus subject areas

  • Theoretical Computer Science
  • Computer Science(all)

Cite this

Stratton, J. A., Stone, S. S., & Hwu, W-M. W. (2008). MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Languages and Compilers for Parallel Computing - 21st International Workshop, LCPC 2008, Revised Selected Papers (pp. 16-30). (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5335 LNCS). https://doi.org/10.1007/978-3-540-89740-8_2

MCUDA : An efficient implementation of CUDA kernels for multi-core CPUs. / Stratton, John A.; Stone, Sam S.; Hwu, Wen-Mei W.

Languages and Compilers for Parallel Computing - 21st International Workshop, LCPC 2008, Revised Selected Papers. 2008. p. 16-30 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics); Vol. 5335 LNCS).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Stratton, JA, Stone, SS & Hwu, W-MW 2008, MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. in Languages and Compilers for Parallel Computing - 21st International Workshop, LCPC 2008, Revised Selected Papers. Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics), vol. 5335 LNCS, pp. 16-30, 21st International Workshop on Languages and Compilers for Parallel Computing, LCPC 2008, Edmonton, AB, Canada, 7/31/08. https://doi.org/10.1007/978-3-540-89740-8_2
Stratton JA, Stone SS, Hwu W-MW. MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs. In Languages and Compilers for Parallel Computing - 21st International Workshop, LCPC 2008, Revised Selected Papers. 2008. p. 16-30. (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)). https://doi.org/10.1007/978-3-540-89740-8_2
Stratton, John A. ; Stone, Sam S. ; Hwu, Wen-Mei W. / MCUDA : An efficient implementation of CUDA kernels for multi-core CPUs. Languages and Compilers for Parallel Computing - 21st International Workshop, LCPC 2008, Revised Selected Papers. 2008. pp. 16-30 (Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)).
@inproceedings{0200d39c73f3435ba8f552c9365e2fca,
title = "MCUDA: An efficient implementation of CUDA kernels for multi-core CPUs",
abstract = "CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures.",
author = "Stratton, {John A.} and Stone, {Sam S.} and Hwu, {Wen-Mei W}",
year = "2008",
month = "12",
day = "1",
doi = "10.1007/978-3-540-89740-8_2",
language = "English (US)",
isbn = "3540897399",
series = "Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)",
pages = "16--30",
booktitle = "Languages and Compilers for Parallel Computing - 21st International Workshop, LCPC 2008, Revised Selected Papers",

}

TY - GEN

T1 - MCUDA

T2 - An efficient implementation of CUDA kernels for multi-core CPUs

AU - Stratton, John A.

AU - Stone, Sam S.

AU - Hwu, Wen-Mei W

PY - 2008/12/1

Y1 - 2008/12/1

N2 - CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures.

AB - CUDA is a data parallel programming model that supports several key abstractions - thread blocks, hierarchical memory and barrier synchronization - for writing applications. This model has proven effective in programming GPUs. In this paper we describe a framework called MCUDA, which allows CUDA programs to be executed efficiently on shared memory, multi-core CPUs. Our framework consists of a set of source-level compiler transformations and a runtime system for parallel execution. Preserving program semantics, the compiler transforms threaded SPMD functions into explicit loops, performs fission to eliminate barrier synchronizations, and converts scalar references to thread-local data to replicated vector references. We describe an implementation of this framework and demonstrate performance approaching that achievable from manually parallelized and optimized C code. With these results, we argue that CUDA can be an effective data-parallel programming model for more than just GPU architectures.

UR - http://www.scopus.com/inward/record.url?scp=58449109179&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=58449109179&partnerID=8YFLogxK

U2 - 10.1007/978-3-540-89740-8_2

DO - 10.1007/978-3-540-89740-8_2

M3 - Conference contribution

AN - SCOPUS:58449109179

SN - 3540897399

SN - 9783540897392

T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

SP - 16

EP - 30

BT - Languages and Compilers for Parallel Computing - 21st International Workshop, LCPC 2008, Revised Selected Papers

ER -