FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs

Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, Wen Mei W. Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDAprogramming model for high-performance computing in FPGAs.

Original languageEnglish (US)
Title of host publication2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009
Pages35-42
Number of pages8
DOIs
StatePublished - Nov 11 2009
Event2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009 - San Francisco, CA, United States
Duration: Jul 27 2009Jul 28 2009

Publication series

Name2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009

Other

Other2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009
CountryUnited States
CitySan Francisco, CA
Period7/27/097/28/09

Fingerprint

Field programmable gate arrays (FPGA)
Application programming interfaces (API)
Parallel processing systems
Thermal effects
Particle accelerators
Clocks
Industry
Energy dissipation
Imaging techniques
Processing
Graphics processing unit

ASJC Scopus subject areas

  • Computer Science Applications
  • Hardware and Architecture

Cite this

Papakonstantinou, A., Gururaj, K., Stratton, J. A., Chen, D., Cong, J., & Hwu, W. M. W. (2009). FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In 2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009 (pp. 35-42). [5226333] (2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009). https://doi.org/10.1109/SASP.2009.5226333

FCUDA : Enabling efficient compilation of CUDA kernels onto FPGAs. / Papakonstantinou, Alexandros; Gururaj, Karthik; Stratton, John A.; Chen, Deming; Cong, Jason; Hwu, Wen Mei W.

2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009. 2009. p. 35-42 5226333 (2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Papakonstantinou, A, Gururaj, K, Stratton, JA, Chen, D, Cong, J & Hwu, WMW 2009, FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. in 2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009., 5226333, 2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009, pp. 35-42, 2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009, San Francisco, CA, United States, 7/27/09. https://doi.org/10.1109/SASP.2009.5226333
Papakonstantinou A, Gururaj K, Stratton JA, Chen D, Cong J, Hwu WMW. FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs. In 2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009. 2009. p. 35-42. 5226333. (2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009). https://doi.org/10.1109/SASP.2009.5226333
Papakonstantinou, Alexandros ; Gururaj, Karthik ; Stratton, John A. ; Chen, Deming ; Cong, Jason ; Hwu, Wen Mei W. / FCUDA : Enabling efficient compilation of CUDA kernels onto FPGAs. 2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009. 2009. pp. 35-42 (2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009).
@inproceedings{88ffff9daeb04c7592efa7e1c19e0291,
title = "FCUDA: Enabling efficient compilation of CUDA kernels onto FPGAs",
abstract = "As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDAprogramming model for high-performance computing in FPGAs.",
author = "Alexandros Papakonstantinou and Karthik Gururaj and Stratton, {John A.} and Deming Chen and Jason Cong and Hwu, {Wen Mei W.}",
year = "2009",
month = "11",
day = "11",
doi = "10.1109/SASP.2009.5226333",
language = "English (US)",
isbn = "9781424449385",
series = "2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009",
pages = "35--42",
booktitle = "2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009",

}

TY - GEN

T1 - FCUDA

T2 - Enabling efficient compilation of CUDA kernels onto FPGAs

AU - Papakonstantinou, Alexandros

AU - Gururaj, Karthik

AU - Stratton, John A.

AU - Chen, Deming

AU - Cong, Jason

AU - Hwu, Wen Mei W.

PY - 2009/11/11

Y1 - 2009/11/11

N2 - As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDAprogramming model for high-performance computing in FPGAs.

AB - As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moore's law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the application's fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDAprogramming model for high-performance computing in FPGAs.

UR - http://www.scopus.com/inward/record.url?scp=70350752429&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=70350752429&partnerID=8YFLogxK

U2 - 10.1109/SASP.2009.5226333

DO - 10.1109/SASP.2009.5226333

M3 - Conference contribution

AN - SCOPUS:70350752429

SN - 9781424449385

T3 - 2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009

SP - 35

EP - 42

BT - 2009 IEEE 7th Symposium on Application Specific Processors, SASP 2009

ER -