TY - JOUR
T1 - Accelerating Sampling and Aggregation Operations in GNN Frameworks with GPU Initiated Direct Storage Accesses
AU - Park, Jeongmin Brian
AU - Mailthody, Vikram Sharma
AU - Qureshi, Zaid
AU - Hwu, Wen Mei
N1 - We would like to acknowledge all of the help from members of the IMPACT research group, the IBM-Illinois Center for Cognitive Computing Systems Research (C3SR) and NVIDIA Research without which we could not have achieved the results reported in this paper. Special thanks to Kun Wu, Isaac Gelado, and Scott Mahlke who generously shared their insights through numerous discussions. This work uses GPUs donated by NVIDIA and is partially supported by the IBM-ILLINOIS C3SR and by the IBM-ILLINOIS Discovery Accelerator Institute (IIDA).
PY - 2024
Y1 - 2024
N2 - Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, training them on large-scale graphs remains a significant challenge due to the lack of efficient storage access and caching methods for graph data. Existing frameworks for training GNNs use CPUs for graph sampling and feature aggregation, while the training and updating of model weights are executed on GPUs. However, our in-depth profiling shows CPUs cannot achieve the graph sampling and feature aggregation throughput required to keep up with GPUs. Furthermore, when the graph and its embeddings do not fit in the CPU memory, the overhead introduced by the operating system, say for handling page-faults, causes gross under-utilization of hardware and prolonged end-to-end execution time. To address these issues, we propose the GPU Initiated Direct Storage Access (GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs while efficiently utilizing all hardware resources, such as CPU memory, storage, and GPU memory. The GIDS dataloader first addresses memory capacity constraints by enabling GPU threads to directly fetch feature vectors from storage. Then, we introduce a set of innovative solutions, including the dynamic storage access accumulator, constant CPU buffer, and GPU software cache with window buffering, to balance resource utilization across the entire system for improved end-to-end training throughput. Our evaluation using a single GPU on terabyte-scale GNN datasets shows that the GIDS dataloader accelerates the overall DGL GNN training pipeline by up to 582× when compared to the current, state-of-the-art DGL dataloader.
AB - Graph Neural Networks (GNNs) are emerging as a powerful tool for learning from graph-structured data and performing sophisticated inference tasks in various application domains. Although GNNs have been shown to be effective on modest-sized graphs, training them on large-scale graphs remains a significant challenge due to the lack of efficient storage access and caching methods for graph data. Existing frameworks for training GNNs use CPUs for graph sampling and feature aggregation, while the training and updating of model weights are executed on GPUs. However, our in-depth profiling shows CPUs cannot achieve the graph sampling and feature aggregation throughput required to keep up with GPUs. Furthermore, when the graph and its embeddings do not fit in the CPU memory, the overhead introduced by the operating system, say for handling page-faults, causes gross under-utilization of hardware and prolonged end-to-end execution time. To address these issues, we propose the GPU Initiated Direct Storage Access (GIDS) dataloader, to enable GPU-oriented GNN training for large-scale graphs while efficiently utilizing all hardware resources, such as CPU memory, storage, and GPU memory. The GIDS dataloader first addresses memory capacity constraints by enabling GPU threads to directly fetch feature vectors from storage. Then, we introduce a set of innovative solutions, including the dynamic storage access accumulator, constant CPU buffer, and GPU software cache with window buffering, to balance resource utilization across the entire system for improved end-to-end training throughput. Our evaluation using a single GPU on terabyte-scale GNN datasets shows that the GIDS dataloader accelerates the overall DGL GNN training pipeline by up to 582× when compared to the current, state-of-the-art DGL dataloader.
UR - https://www.scopus.com/pages/publications/85190662550
UR - https://www.scopus.com/inward/citedby.url?scp=85190662550&partnerID=8YFLogxK
U2 - 10.14778/3648160.3648166
DO - 10.14778/3648160.3648166
M3 - Conference article
AN - SCOPUS:85190662550
SN - 2150-8097
VL - 17
SP - 1227
EP - 1240
JO - Proceedings of the VLDB Endowment
JF - Proceedings of the VLDB Endowment
IS - 6
T2 - 50th International Conference on Very Large Data Bases, VLDB 2024
Y2 - 24 August 2024 through 29 August 2024
ER -