TY - GEN
T1 - An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models
AU - Park, Sang Soo
AU - Kim, Kyungsoo
AU - So, Jinin
AU - Jung, Jin
AU - Lee, Jonggeon
AU - Woo, Kyoungwan
AU - Kim, Nayeon
AU - Lee, Younghyun
AU - Kim, Hyungyo
AU - Kwon, Yongsuk
AU - Kim, Jinhyun
AU - Lee, Jieun
AU - Cho, Yeongon
AU - Tai, Yongmin
AU - Cho, Jeonghyeon
AU - Song, Hoyoung
AU - Ahn, Jung Ho
AU - Kim, Nam Sung
N1 - Publisher Copyright:
© 2024 IEEE.
PY - 2024
Y1 - 2024
N2 - Transformer-based large language models (LLMs) such as Generative Pre-Trained Transformer (GPT) have become popular due to their remarkable performance across diverse applications, including text generation and translation. For LLM training and inference, the GPU has been the predominant accelerator with its pervasive software development ecosystem and powerful computing capability. However, as the size of LLMs keeps increasing for higher performance and/or more complex applications, a single GPU cannot efficiently accelerate LLM training and inference due to its limited memory capacity, which demands frequent transfers of the model parameters needed by the GPU to compute the current layer(s) from the host CPU memory/storage. A GPU appliance may provide enough aggregated memory capacity with multiple GPUs, but it suffers from frequent transfers of intermediate values among GPU devices, each accelerating specific layers of a given LLM. As the frequent transfers of these model parameters and intermediate values are performed over relatively slow device-To-device interconnects such as PCIe or NVLink, they become the key bottleneck for efficient acceleration of LLMs. Focusing on accelerating LLM inference, which is essential for many commercial services, we develop CXL-PNM, a processing near memory (PNM) platform based on the emerging interconnect technology, Compute eXpress Link (CXL). Specifically, we first devise an LPDDR5X-based CXL memory architecture with 512GB of capacity and 1.1TB/s of bandwidth, which boasts 16× larger capacity and 10× higher bandwidth than GDDR6and DDR5-based CXL memory architectures, respectively, under a module form-factor constraint. Second, we design a CXLPNM controller architecture integrated with an LLM inference accelerator, exploiting the unique capabilities of such CXL memory to overcome the disadvantages of competing technologies such as HBM-PIM and AxDIMM. Lastly, we implement a CXLPNM software stack that supports seamless and transparent use of CXL-PNM for Python-based LLM programs. Our evaluation shows that a CXL-PNM appliance with 8 CXL-PNM devices offers 23% lower latency, 31% higher throughput, and 2.8× higher energy efficiency at 30% lower hardware cost than a GPU appliance with 8 GPU devices for an LLM inference service.
AB - Transformer-based large language models (LLMs) such as Generative Pre-Trained Transformer (GPT) have become popular due to their remarkable performance across diverse applications, including text generation and translation. For LLM training and inference, the GPU has been the predominant accelerator with its pervasive software development ecosystem and powerful computing capability. However, as the size of LLMs keeps increasing for higher performance and/or more complex applications, a single GPU cannot efficiently accelerate LLM training and inference due to its limited memory capacity, which demands frequent transfers of the model parameters needed by the GPU to compute the current layer(s) from the host CPU memory/storage. A GPU appliance may provide enough aggregated memory capacity with multiple GPUs, but it suffers from frequent transfers of intermediate values among GPU devices, each accelerating specific layers of a given LLM. As the frequent transfers of these model parameters and intermediate values are performed over relatively slow device-To-device interconnects such as PCIe or NVLink, they become the key bottleneck for efficient acceleration of LLMs. Focusing on accelerating LLM inference, which is essential for many commercial services, we develop CXL-PNM, a processing near memory (PNM) platform based on the emerging interconnect technology, Compute eXpress Link (CXL). Specifically, we first devise an LPDDR5X-based CXL memory architecture with 512GB of capacity and 1.1TB/s of bandwidth, which boasts 16× larger capacity and 10× higher bandwidth than GDDR6and DDR5-based CXL memory architectures, respectively, under a module form-factor constraint. Second, we design a CXLPNM controller architecture integrated with an LLM inference accelerator, exploiting the unique capabilities of such CXL memory to overcome the disadvantages of competing technologies such as HBM-PIM and AxDIMM. Lastly, we implement a CXLPNM software stack that supports seamless and transparent use of CXL-PNM for Python-based LLM programs. Our evaluation shows that a CXL-PNM appliance with 8 CXL-PNM devices offers 23% lower latency, 31% higher throughput, and 2.8× higher energy efficiency at 30% lower hardware cost than a GPU appliance with 8 GPU devices for an LLM inference service.
KW - CXL
KW - CXL-PNM
KW - LLM
KW - LPDDR
UR - http://www.scopus.com/inward/record.url?scp=85190237695&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85190237695&partnerID=8YFLogxK
U2 - 10.1109/HPCA57654.2024.00078
DO - 10.1109/HPCA57654.2024.00078
M3 - Conference contribution
AN - SCOPUS:85190237695
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 970
EP - 982
BT - Proceedings - 2024 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024
PB - IEEE Computer Society
T2 - 30th IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024
Y2 - 2 March 2024 through 6 March 2024
ER -