An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models

Sang Soo Park, Kyungsoo Kim, Jinin So, Jin Jung, Jonggeon Lee, Kyoungwan Woo, Nayeon Kim, Younghyun Lee, Hyungyo Kim, Yongsuk Kwon, Jinhyun Kim, Jieun Lee, Yeongon Cho, Yongmin Tai, Jeonghyeon Cho, Hoyoung Song, Jung Ho Ahn, Nam Sung Kim

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Transformer-based large language models (LLMs) such as Generative Pre-Trained Transformer (GPT) have become popular due to their remarkable performance across diverse applications, including text generation and translation. For LLM training and inference, the GPU has been the predominant accelerator with its pervasive software development ecosystem and powerful computing capability. However, as the size of LLMs keeps increasing for higher performance and/or more complex applications, a single GPU cannot efficiently accelerate LLM training and inference due to its limited memory capacity, which demands frequent transfers of the model parameters needed by the GPU to compute the current layer(s) from the host CPU memory/storage. A GPU appliance may provide enough aggregated memory capacity with multiple GPUs, but it suffers from frequent transfers of intermediate values among GPU devices, each accelerating specific layers of a given LLM. As the frequent transfers of these model parameters and intermediate values are performed over relatively slow device-To-device interconnects such as PCIe or NVLink, they become the key bottleneck for efficient acceleration of LLMs. Focusing on accelerating LLM inference, which is essential for many commercial services, we develop CXL-PNM, a processing near memory (PNM) platform based on the emerging interconnect technology, Compute eXpress Link (CXL). Specifically, we first devise an LPDDR5X-based CXL memory architecture with 512GB of capacity and 1.1TB/s of bandwidth, which boasts 16× larger capacity and 10× higher bandwidth than GDDR6and DDR5-based CXL memory architectures, respectively, under a module form-factor constraint. Second, we design a CXLPNM controller architecture integrated with an LLM inference accelerator, exploiting the unique capabilities of such CXL memory to overcome the disadvantages of competing technologies such as HBM-PIM and AxDIMM. Lastly, we implement a CXLPNM software stack that supports seamless and transparent use of CXL-PNM for Python-based LLM programs. Our evaluation shows that a CXL-PNM appliance with 8 CXL-PNM devices offers 23% lower latency, 31% higher throughput, and 2.8× higher energy efficiency at 30% lower hardware cost than a GPU appliance with 8 GPU devices for an LLM inference service.

Original languageEnglish (US)
Title of host publicationProceedings - 2024 IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024
PublisherIEEE Computer Society
Pages970-982
Number of pages13
ISBN (Electronic)9798350393132
DOIs
StatePublished - 2024
Event30th IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024 - Edinburgh, United Kingdom
Duration: Mar 2 2024Mar 6 2024

Publication series

NameProceedings - International Symposium on High-Performance Computer Architecture
ISSN (Print)1530-0897

Conference

Conference30th IEEE International Symposium on High-Performance Computer Architecture, HPCA 2024
Country/TerritoryUnited Kingdom
CityEdinburgh
Period3/2/243/6/24

Keywords

  • CXL
  • CXL-PNM
  • LLM
  • LPDDR

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'An LPDDR-based CXL-PNM Platform for TCO-efficient Inference of Transformer-based Large Language Models'. Together they form a unique fingerprint.

Cite this