Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference

Hyungyo Kim, Gaohan Ye, Nachuan Wang, Amir Yazdanbakhsh, Nam Sung Kim

Research output: Contribution to journalArticlepeer-review

Abstract

The ever-increasing number of parameters in Large Language Models (LLMs) demands many expensive GPUs for both inference and training. This is because even such a high-end GPU such as NVIDIA A100 can store only a subset of parameters due to its limited memory capacity. To reduce the number of required GPUs, especially for inference, we may exploit the large memory capacity of (host) CPU to store not only all the model parameters but also intermediate outputs which also require a substantial memory capacity. However, this necessitates frequent data transfers between CPU and GPU over the slow PCIe interface, creating a bottleneck that hinders the accomplishment of both low latency and high throughput in inference. To address such a challenge, we first propose CPU-GPU cooperative computing that exploits the Advanced Matrix Extensions (AMX) capability of the latest Intel CPU, codenamed Sapphire Rapids (SPR). Second, we propose an adaptive model partitioning policy that determines the layers of a given LLM to be run on CPU and GPU, respectively, based on their memory capacity requirement and arithmetic intensity. As CPU executes the layers with large memory capacity but low arithmetic intensity, the amount of data transferred through the PCIe interface is significantly reduced, thereby improving the LLM inference performance. Our evaluation demonstrates that CPU-GPU cooperative computing, based on this policy, delivers 12.1× lower latency and 5.4× higher throughput than GPU-only computing for OPT-30B inference when both CPU-GPU and GPU-only computing store the model in CPU memory.

Original languageEnglish (US)
Pages (from-to)117-120
Number of pages4
JournalIEEE Computer Architecture Letters
Volume23
Issue number1
DOIs
StatePublished - Jan 1 2024

Keywords

  • advance matrix extensions
  • cooperative heterogeneous computing
  • Large language models

ASJC Scopus subject areas

  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'Exploiting Intel Advanced Matrix Extensions (AMX) for Large Language Model Inference'. Together they form a unique fingerprint.

Cite this