TY - JOUR
T1 - Unleashing the Potential of PIM
T2 - Accelerating Large Batched Inference of Transformer-Based Generative Models
AU - Choi, Jaewan
AU - Park, Jaehyun
AU - Kyung, Kwanhee
AU - Kim, Nam Sung
AU - Ahn, Jung Ho
N1 - Publisher Copyright:
© 2002-2011 IEEE.
PY - 2023/7/1
Y1 - 2023/7/1
N2 - Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models' weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.
AB - Transformer-based generative models, such as GPT, summarize an input sequence by generating key/value (KV) matrices through attention and generate the corresponding output sequence by utilizing these matrices once per token of the sequence. Both input and output sequences tend to get longer, which improves the understanding of contexts and conversation quality. These models are also typically batched for inference to improve the serving throughput. All these trends enable the models' weights to be reused effectively, increasing the relative importance of sequence generation, especially in processing KV matrices through attention. We identify that the conventional computing platforms (e.g., GPUs) are not efficient at handling this attention part for inference because each request generates different KV matrices, it has a low operation per byte ratio regardless of the batch size, and the aggregate size of the KV matrices can even surpass that of the entire model weights. This motivates us to propose AttAcc, which exploits the fact that the KV matrices are written once during summarization but used many times (proportional to the output sequence length), each multiplied by the embedding vector corresponding to an output token. The volume of data entering/leaving AttAcc could be more than orders of magnitude smaller than what should be read internally for attention. We design AttAcc with multiple processing-in-memory devices, each multiplying the embedding vector with the portion of the KV matrices within the devices, saving external (inter-device) bandwidth and energy consumption.
KW - Transformer-based generative model
KW - attention
KW - processing-in-memory
UR - http://www.scopus.com/inward/record.url?scp=85168277477&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85168277477&partnerID=8YFLogxK
U2 - 10.1109/LCA.2023.3305386
DO - 10.1109/LCA.2023.3305386
M3 - Article
AN - SCOPUS:85168277477
SN - 1556-6056
VL - 22
SP - 113
EP - 116
JO - IEEE Computer Architecture Letters
JF - IEEE Computer Architecture Letters
IS - 2
ER -