Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems

Xiaofan Zhang, Yuan Ma, Jinjun Xiong, Wen Mei Hwu, Volodymyr Kindratenko, Deming Chen

Research output: Contribution to journalArticlepeer-review

Abstract

Deep neural network (DNN)-based video analysis has become one of the most essential and challenging tasks to capture implicit information from video streams. Although DNNs significantly improve the analysis quality, they introduce intensive compute and memory demands and require dedicated hardware for efficient processing. The customized heterogeneous system is one of the promising solutions with general-purpose processors (CPUs) and specialized processors (DNN Accelerators). Among various heterogeneous systems, the combination of CPU and FPGA has been intensively studied for DNN inference with improved latency and energy consumption compared to CPU + GPU schemes and with increased flexibility and reduced time-to-market cost compared to CPU + ASIC designs. However, deploying DNN-based video analysis on CPU + FPGA systems still presents challenges from the tedious RTL programming, the intricate design verification, and the time-consuming design space exploration. To address these challenges, we present a novel framework, called EcoSys, to explore co-design and optimization opportunities on CPU-FPGA heterogeneous systems for accelerating video analysis. Novel technologies include 1) a coherent memory space shared by the host and the customized accelerator to enable efficient task partitioning and online DNN model refinement with reduced data transfer latency; 2) an end-to-end design flow that supports high-level design abstraction and allows rapid development of customized hardware accelerators from Python-based DNN descriptions; 3) a design space exploration (DSE) engine that determines the design space and explores the optimized solutions by considering the targeted heterogeneous system and user-specific constraints; and 4) a complete set of co-optimization solutions, including a layer-based pipeline, a feature map partition scheme, and an efficient memory hierarchical design for the accelerator and multithreading programming for the CPU. In this article, we demonstrate our design framework to accelerate the long-term recurrent convolution network (LRCN), which analyzes the input video and output one semantic caption for each frame. EcoSys can deliver 314.7 and 58.1 frames/s by targeting the LRCN model with AlexNet and VGG-16 backbone, respectively. Compared to the multithreaded CPU and pure FPGA design, EcoSys achieves 20.6× and 5.3× higher throughput performance.

Original languageEnglish (US)
Pages (from-to)1606-1619
Number of pages14
JournalIEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems
Volume41
Issue number6
DOIs
StatePublished - Jun 1 2022

Keywords

  • Acceleration.
  • CAPI
  • Central Processing Unit
  • DNN
  • Feature extraction
  • Field programmable gate arrays
  • Graphics processing units
  • Hardware
  • Heterogeneous System
  • HW/SW Co-Design
  • Memory management
  • Task analysis
  • coherent accelerator processor interface (CAPI)
  • heterogeneous system
  • Acceleration
  • HW/SW co-design
  • deep neural network (DNN)

ASJC Scopus subject areas

  • Software
  • Electrical and Electronic Engineering
  • Computer Graphics and Computer-Aided Design

Fingerprint

Dive into the research topics of 'Exploring HW/SW Co-Design for Video Analysis on CPU-FPGA Heterogeneous Systems'. Together they form a unique fingerprint.

Cite this