TY - GEN
T1 - AttAcc! Unleashing the Power of PIM for Batched Transformer-based Generative Model Inference
AU - Park, Jaehyun
AU - Choi, Jaewan
AU - Kyung, Kwanhee
AU - Kim, Michael Jaemin
AU - Kwon, Yongsuk
AU - Kim, Nam Sung
AU - Ahn, Jung Ho
N1 - Publisher Copyright:
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/4/27
Y1 - 2024/4/27
N2 - The Transformer-based generative model (TbGM), comprising summarization (Sum) and generation (Gen) stages, has demonstrated unprecedented generative performance across a wide range of applications. However, it also demands immense amounts of compute and memory resources. Especially, the Gen stages, consisting of the attention and fully-connected (FC) layers, dominate the overall execution time. Meanwhile, we reveal that the conventional system with GPUs used for TbGM inference cannot efficiently execute the attention layer, even with batching, due to various constraints. To address this inefficiency, we first propose AttAcc, a processing-in-memory (PIM) architecture for efficient execution of the attention layer. Subsequently, for the end-to-end acceleration of TbGM inference, we propose a novel heterogeneous system architecture and optimizations that strategically use xPU and PIM together. It leverages the high memory bandwidth of AttAcc for the attention layer and the powerful compute capability of the conventional system for the FC layer. Lastly, we demonstrate that our GPU-PIM system outperforms the conventional system with the same memory capacity, improving performance and energy efficiency of running a 175B TbGM by up to 2.81× and 2.67×, respectively.
AB - The Transformer-based generative model (TbGM), comprising summarization (Sum) and generation (Gen) stages, has demonstrated unprecedented generative performance across a wide range of applications. However, it also demands immense amounts of compute and memory resources. Especially, the Gen stages, consisting of the attention and fully-connected (FC) layers, dominate the overall execution time. Meanwhile, we reveal that the conventional system with GPUs used for TbGM inference cannot efficiently execute the attention layer, even with batching, due to various constraints. To address this inefficiency, we first propose AttAcc, a processing-in-memory (PIM) architecture for efficient execution of the attention layer. Subsequently, for the end-to-end acceleration of TbGM inference, we propose a novel heterogeneous system architecture and optimizations that strategically use xPU and PIM together. It leverages the high memory bandwidth of AttAcc for the attention layer and the powerful compute capability of the conventional system for the FC layer. Lastly, we demonstrate that our GPU-PIM system outperforms the conventional system with the same memory capacity, improving performance and energy efficiency of running a 175B TbGM by up to 2.81× and 2.67×, respectively.
KW - DRAM
KW - processing-in-memory
KW - transformer-based generative model
UR - http://www.scopus.com/inward/record.url?scp=85192177035&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85192177035&partnerID=8YFLogxK
U2 - 10.1145/3620665.3640422
DO - 10.1145/3620665.3640422
M3 - Conference contribution
AN - SCOPUS:85192177035
T3 - International Conference on Architectural Support for Programming Languages and Operating Systems - ASPLOS
SP - 103
EP - 119
BT - Summer Cycle
PB - Association for Computing Machinery
T2 - 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS 2024
Y2 - 27 April 2024 through 1 May 2024
ER -