DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, Yuxiong He

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

The landscape of transformer model inference is increasingly diverse in model size, model characteristics, latency and throughput requirements, hardware requirements, etc. With such diversity, designing a versatile inference system is challenging. DeepSpeed-Inference addresses these challenges by (1) a multi-GPU inference solution to minimize latency while maximizing throughput for both dense and sparse transformers when the model fits in aggregate GPU memory, and (2) a heterogeneous inference solution that leverages CPU/NVMe/GPU memory to enable high-throughput inference for models larger than aggregate GPU memory. DeepSpeed-Inference reduces latency by 6.4× and increases throughput by 1.5 ×over the state-of-the-art. It enables trillion parameter scale inference under real-time latency constraints by leveraging hundreds of GPUs, an unprecedented scale for inference. It can inference 25 ×larger models than with GPU-only solutions, while delivering a high throughput of 84 TFLOPS (over 50% of A6000 peak).

Original languageEnglish (US)
Title of host publicationProceedings of SC 2022
Subtitle of host publicationInternational Conference for High Performance Computing, Networking, Storage and Analysis
PublisherIEEE Computer Society
ISBN (Electronic)9781665454445
DOIs
StatePublished - 2022
Externally publishedYes
Event2022 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2022 - Dallas, United States
Duration: Nov 13 2022Nov 18 2022

Publication series

NameInternational Conference for High Performance Computing, Networking, Storage and Analysis, SC
Volume2022-November
ISSN (Print)2167-4329
ISSN (Electronic)2167-4337

Conference

Conference2022 International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2022
Country/TerritoryUnited States
CityDallas
Period11/13/2211/18/22

Keywords

  • Deep Learning
  • DeepSpeed
  • Distributed Inference
  • Mixture of Experts
  • PyTorch
  • Transformer models

ASJC Scopus subject areas

  • Software
  • Computer Networks and Communications
  • Computer Science Applications
  • Hardware and Architecture

Fingerprint

Dive into the research topics of 'DeepSpeed-Inference: Enabling Efficient Inference of Transformer Models at Unprecedented Scale'. Together they form a unique fingerprint.

Cite this