FCUDA-HB

Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow

Ying Chen, Tan Nguyen, Yao Chen, Swathi T. Gurumani, Yun Liang, Kyle Rupnow, Jason Cong, Wen-Mei W Hwu, Deming Chen

Research output: Contribution to journalArticle

Abstract

Recent progress in high-level synthesis (HLS) has helped raise the abstraction level of hardware design. HLS flows reduce designer effort by allowing development in a high-level language, which improves debugging, code reuse and ability to explore different implementation options. However, although the HLS process is fast, implementation and performance analysis still require lengthy logic synthesis and physical design. For design optimization, HLS tools require design space exploration to obtain parallelism at multiple levels of granularity including parallelism within a single HLS-generated core and parallelism between multiple instances of cores. Core interconnect and external bandwidth limitations can significantly impact feasible options in the design space. With many dimensions in a design space exploration, it quickly becomes infeasible to perform full logic synthesis and physical design for each possible design point. However, generation and evaluation of communications infrastructure as part of the exploration is critical to determine the system performance. Thus, in this paper, we extend the prior multilevel granularity parallelism exploration in the FCUDA HLS flow, which takes CUDA code as design input and generates a corresponding field programmable gate array implementation. Our framework performs an initial characterization of the application design space, then analytically explores the design space considering parallelism, core interconnect, and external memory bandwidth, and selects a pareto-optimal set of designs. Our flow is completely automated to perform the exploration to characterize the analytical model, perform the exploration, select a solution, and integrate multiple instantiations of FCUDA cores via an advanced extensible interface bus interconnect. Our results demonstrate that this new FCUDA flow efficiently identifies and generates implementations with up to × improved system performance compared to single-level granularity parallelism (core-level optimization).

Original languageEnglish (US)
Article number7450674
Pages (from-to)2032-2045
Number of pages14
JournalIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Volume35
Issue number12
DOIs
StatePublished - Dec 1 2016

Fingerprint

Field programmable gate arrays (FPGA)
Bandwidth
High level languages
Core levels
High level synthesis
Analytical models
Hardware
Data storage equipment
Communication

Keywords

  • Bus-generation
  • communication bus
  • high-level synthesis (HLS)
  • system generation

ASJC Scopus subject areas

  • Software
  • Computer Graphics and Computer-Aided Design
  • Electrical and Electronic Engineering

Cite this

FCUDA-HB : Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow. / Chen, Ying; Nguyen, Tan; Chen, Yao; Gurumani, Swathi T.; Liang, Yun; Rupnow, Kyle; Cong, Jason; Hwu, Wen-Mei W; Chen, Deming.

In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, Vol. 35, No. 12, 7450674, 01.12.2016, p. 2032-2045.

Research output: Contribution to journalArticle

Chen, Ying ; Nguyen, Tan ; Chen, Yao ; Gurumani, Swathi T. ; Liang, Yun ; Rupnow, Kyle ; Cong, Jason ; Hwu, Wen-Mei W ; Chen, Deming. / FCUDA-HB : Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow. In: IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems. 2016 ; Vol. 35, No. 12. pp. 2032-2045.
@article{ed446eccd4f74bf081d67c2f481d7af1,
title = "FCUDA-HB: Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow",
abstract = "Recent progress in high-level synthesis (HLS) has helped raise the abstraction level of hardware design. HLS flows reduce designer effort by allowing development in a high-level language, which improves debugging, code reuse and ability to explore different implementation options. However, although the HLS process is fast, implementation and performance analysis still require lengthy logic synthesis and physical design. For design optimization, HLS tools require design space exploration to obtain parallelism at multiple levels of granularity including parallelism within a single HLS-generated core and parallelism between multiple instances of cores. Core interconnect and external bandwidth limitations can significantly impact feasible options in the design space. With many dimensions in a design space exploration, it quickly becomes infeasible to perform full logic synthesis and physical design for each possible design point. However, generation and evaluation of communications infrastructure as part of the exploration is critical to determine the system performance. Thus, in this paper, we extend the prior multilevel granularity parallelism exploration in the FCUDA HLS flow, which takes CUDA code as design input and generates a corresponding field programmable gate array implementation. Our framework performs an initial characterization of the application design space, then analytically explores the design space considering parallelism, core interconnect, and external memory bandwidth, and selects a pareto-optimal set of designs. Our flow is completely automated to perform the exploration to characterize the analytical model, perform the exploration, select a solution, and integrate multiple instantiations of FCUDA cores via an advanced extensible interface bus interconnect. Our results demonstrate that this new FCUDA flow efficiently identifies and generates implementations with up to × improved system performance compared to single-level granularity parallelism (core-level optimization).",
keywords = "Bus-generation, communication bus, high-level synthesis (HLS), system generation",
author = "Ying Chen and Tan Nguyen and Yao Chen and Gurumani, {Swathi T.} and Yun Liang and Kyle Rupnow and Jason Cong and Hwu, {Wen-Mei W} and Deming Chen",
year = "2016",
month = "12",
day = "1",
doi = "10.1109/TCAD.2016.2552821",
language = "English (US)",
volume = "35",
pages = "2032--2045",
journal = "IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems",
issn = "0278-0070",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
number = "12",

}

TY - JOUR

T1 - FCUDA-HB

T2 - Hierarchical and Scalable Bus Architecture Generation on FPGAs With the FCUDA Flow

AU - Chen, Ying

AU - Nguyen, Tan

AU - Chen, Yao

AU - Gurumani, Swathi T.

AU - Liang, Yun

AU - Rupnow, Kyle

AU - Cong, Jason

AU - Hwu, Wen-Mei W

AU - Chen, Deming

PY - 2016/12/1

Y1 - 2016/12/1

N2 - Recent progress in high-level synthesis (HLS) has helped raise the abstraction level of hardware design. HLS flows reduce designer effort by allowing development in a high-level language, which improves debugging, code reuse and ability to explore different implementation options. However, although the HLS process is fast, implementation and performance analysis still require lengthy logic synthesis and physical design. For design optimization, HLS tools require design space exploration to obtain parallelism at multiple levels of granularity including parallelism within a single HLS-generated core and parallelism between multiple instances of cores. Core interconnect and external bandwidth limitations can significantly impact feasible options in the design space. With many dimensions in a design space exploration, it quickly becomes infeasible to perform full logic synthesis and physical design for each possible design point. However, generation and evaluation of communications infrastructure as part of the exploration is critical to determine the system performance. Thus, in this paper, we extend the prior multilevel granularity parallelism exploration in the FCUDA HLS flow, which takes CUDA code as design input and generates a corresponding field programmable gate array implementation. Our framework performs an initial characterization of the application design space, then analytically explores the design space considering parallelism, core interconnect, and external memory bandwidth, and selects a pareto-optimal set of designs. Our flow is completely automated to perform the exploration to characterize the analytical model, perform the exploration, select a solution, and integrate multiple instantiations of FCUDA cores via an advanced extensible interface bus interconnect. Our results demonstrate that this new FCUDA flow efficiently identifies and generates implementations with up to × improved system performance compared to single-level granularity parallelism (core-level optimization).

AB - Recent progress in high-level synthesis (HLS) has helped raise the abstraction level of hardware design. HLS flows reduce designer effort by allowing development in a high-level language, which improves debugging, code reuse and ability to explore different implementation options. However, although the HLS process is fast, implementation and performance analysis still require lengthy logic synthesis and physical design. For design optimization, HLS tools require design space exploration to obtain parallelism at multiple levels of granularity including parallelism within a single HLS-generated core and parallelism between multiple instances of cores. Core interconnect and external bandwidth limitations can significantly impact feasible options in the design space. With many dimensions in a design space exploration, it quickly becomes infeasible to perform full logic synthesis and physical design for each possible design point. However, generation and evaluation of communications infrastructure as part of the exploration is critical to determine the system performance. Thus, in this paper, we extend the prior multilevel granularity parallelism exploration in the FCUDA HLS flow, which takes CUDA code as design input and generates a corresponding field programmable gate array implementation. Our framework performs an initial characterization of the application design space, then analytically explores the design space considering parallelism, core interconnect, and external memory bandwidth, and selects a pareto-optimal set of designs. Our flow is completely automated to perform the exploration to characterize the analytical model, perform the exploration, select a solution, and integrate multiple instantiations of FCUDA cores via an advanced extensible interface bus interconnect. Our results demonstrate that this new FCUDA flow efficiently identifies and generates implementations with up to × improved system performance compared to single-level granularity parallelism (core-level optimization).

KW - Bus-generation

KW - communication bus

KW - high-level synthesis (HLS)

KW - system generation

UR - http://www.scopus.com/inward/record.url?scp=84999188190&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84999188190&partnerID=8YFLogxK

U2 - 10.1109/TCAD.2016.2552821

DO - 10.1109/TCAD.2016.2552821

M3 - Article

VL - 35

SP - 2032

EP - 2045

JO - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

JF - IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

SN - 0278-0070

IS - 12

M1 - 7450674

ER -