Efficient compilation of CUDA kernels for high-performance computing on FPGAs

Alexandros Papakonstantinou, Karthik Gururaj, John A. Stratton, Deming Chen, Jason Cong, Wen-Mei W Hwu

Research output: Contribution to journalArticle

Abstract

The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

Original languageEnglish (US)
Article number25
JournalTransactions on Embedded Computing Systems
Volume13
Issue number2
DOIs
StatePublished - Oct 21 2013

Fingerprint

Field programmable gate arrays (FPGA)
Particle accelerators
Imaging techniques
Processing

Keywords

  • FPGA
  • Heterogeneous compute systems
  • High-level synthesis
  • High-performance computing
  • Parallel programming model
  • Source-to-source compiler

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture

Cite this

Efficient compilation of CUDA kernels for high-performance computing on FPGAs. / Papakonstantinou, Alexandros; Gururaj, Karthik; Stratton, John A.; Chen, Deming; Cong, Jason; Hwu, Wen-Mei W.

In: Transactions on Embedded Computing Systems, Vol. 13, No. 2, 25, 21.10.2013.

Research output: Contribution to journalArticle

Papakonstantinou, Alexandros ; Gururaj, Karthik ; Stratton, John A. ; Chen, Deming ; Cong, Jason ; Hwu, Wen-Mei W. / Efficient compilation of CUDA kernels for high-performance computing on FPGAs. In: Transactions on Embedded Computing Systems. 2013 ; Vol. 13, No. 2.
@article{ff20d84cce5e42028131f45dc3a4a8f2,
title = "Efficient compilation of CUDA kernels for high-performance computing on FPGAs",
abstract = "The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.",
keywords = "FPGA, Heterogeneous compute systems, High-level synthesis, High-performance computing, Parallel programming model, Source-to-source compiler",
author = "Alexandros Papakonstantinou and Karthik Gururaj and Stratton, {John A.} and Deming Chen and Jason Cong and Hwu, {Wen-Mei W}",
year = "2013",
month = "10",
day = "21",
doi = "10.1145/2514641.2514652",
language = "English (US)",
volume = "13",
journal = "Transactions on Embedded Computing Systems",
issn = "1539-9087",
publisher = "Association for Computing Machinery (ACM)",
number = "2",

}

TY - JOUR

T1 - Efficient compilation of CUDA kernels for high-performance computing on FPGAs

AU - Papakonstantinou, Alexandros

AU - Gururaj, Karthik

AU - Stratton, John A.

AU - Chen, Deming

AU - Cong, Jason

AU - Hwu, Wen-Mei W

PY - 2013/10/21

Y1 - 2013/10/21

N2 - The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

AB - The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.

KW - FPGA

KW - Heterogeneous compute systems

KW - High-level synthesis

KW - High-performance computing

KW - Parallel programming model

KW - Source-to-source compiler

UR - http://www.scopus.com/inward/record.url?scp=84885596347&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84885596347&partnerID=8YFLogxK

U2 - 10.1145/2514641.2514652

DO - 10.1145/2514641.2514652

M3 - Article

AN - SCOPUS:84885596347

VL - 13

JO - Transactions on Embedded Computing Systems

JF - Transactions on Embedded Computing Systems

SN - 1539-9087

IS - 2

M1 - 25

ER -