TY - GEN
T1 - Acceleration of Graph Neural Networks with Heterogenous Accelerators Architecture
AU - Cao, Kaiwen
AU - Gajjar, Archit
AU - Gerstman, Liad
AU - Wu, Kun
AU - Chalamalasetti, Sai Rahul
AU - Dhakal, Aditya
AU - Pedretti, Giacomo
AU - Prakash, Pavana
AU - Hwu, Wen Mei
AU - Chen, Deming
AU - Milojicic, Dejan
N1 - This work is supported by Hewlett Packard Labs, AMD Center of Excellence at UIUC, and the AMD Heterogeneous Adaptive Compute Cluster (HACC) initiative.
PY - 2024
Y1 - 2024
N2 - Graph Neural Networks (GNNs) have been used to solve complex problems of drug discovery, social media analysis, etc. Meanwhile, GPUs are becoming dominating accelerators to improve deep neural network performance. However, due to the characteristics of graph data, it is challenging to accelerate GNN-type workloads with GPUs alone. GraphSAGE is one representative GNN workload that uses sampling to improve GNN learning efficiency. Profiling the GraphSAGE using PyG library reveals that the sampling stage on the CPU is the bottleneck. Hence, we propose a heterogeneous system architecture solution with the sampling algorithm accelerated on customizable accelerators (FPGA), and feed sampled data into GPU training through a PCIe Peer-to-Peer (P2P) communication flow. With FPGA acceleration, for the sampling stage alone, we achieve a speed-up of 2.38× to 8.55× compared with sampling on CPU. For end-to-end latency, compared with the traditional flow, we achieve a speed-up of 1.24× to 1.99 ×.
AB - Graph Neural Networks (GNNs) have been used to solve complex problems of drug discovery, social media analysis, etc. Meanwhile, GPUs are becoming dominating accelerators to improve deep neural network performance. However, due to the characteristics of graph data, it is challenging to accelerate GNN-type workloads with GPUs alone. GraphSAGE is one representative GNN workload that uses sampling to improve GNN learning efficiency. Profiling the GraphSAGE using PyG library reveals that the sampling stage on the CPU is the bottleneck. Hence, we propose a heterogeneous system architecture solution with the sampling algorithm accelerated on customizable accelerators (FPGA), and feed sampled data into GPU training through a PCIe Peer-to-Peer (P2P) communication flow. With FPGA acceleration, for the sampling stage alone, we achieve a speed-up of 2.38× to 8.55× compared with sampling on CPU. For end-to-end latency, compared with the traditional flow, we achieve a speed-up of 1.24× to 1.99 ×.
KW - Field Programmable Gate Array (FPGA)
KW - Graph Neural Network (GNN)
KW - Graphics Processing Unit (GPU)
KW - High-Level Synthesis (HLS)
KW - Peer-to-Peer (P2P)
UR - http://www.scopus.com/inward/record.url?scp=85217170233&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85217170233&partnerID=8YFLogxK
U2 - 10.1109/SCW63240.2024.00148
DO - 10.1109/SCW63240.2024.00148
M3 - Conference contribution
AN - SCOPUS:85217170233
T3 - Proceedings of SC 2024-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
SP - 1081
EP - 1089
BT - Proceedings of SC 2024-W
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 2024 Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC Workshops 2024
Y2 - 17 November 2024 through 22 November 2024
ER -