V10: Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and Fairness

Yuqi Xue, Lifeng Nai, Yiqi Liu, Jian Huang

Research output: Chapter in Book/Report/Conference proceedingConference contribution


Modern cloud platforms have deployed neural processing units (NPUs) like Google Cloud TPUs to accelerate online machine learning (ML) inference services. To improve the resource utilization of NPUs, they allow multiple ML applications to share the same NPU, and developed both time-multiplexed and preemptive-based sharing mechanisms. However, our study with real-world NPUs discloses that these approaches suffer from surprisingly low utilization, due to the lack of support for fine-grained hardware resource sharing in the NPU. Specifically, its separate systolic array and vector unit cannot be fully utilized at the same time, which requires fundamental hardware assistance for supporting multi-tenancy. In this paper, we present V10, a hardware-assisted NPU multi-tenancy framework for improving resource utilization, while ensuring fairness for different ML services. We rethink the NPU architecture for supporting multi-tenancy. V10 employs an operator scheduler for enabling concurrent operator executions on the systolic array and the vector unit and offers flexibility for enforcing different priority-based resource-sharing mechanisms. V10 also enables fine-grained operator preemption and lightweight context switch in the NPU. To further improve NPU utilization, V10 also develops a clustering-based workload collocation mechanism for identifying the best-matching ML services on a shared NPU. We implement V10 with an NPU simulator. Our experiments with various ML workloads from MLPerf AI Benchmarks demonstrate that V10 can improve the overall NPU utilization by 1.64×, increase the aggregated throughput by 1.57×, reduce the average latency of ML services by 1.56×, and tail latency by 1.74× on average, in comparison with state-of-the-art NPU multi-tenancy approaches.

Original languageEnglish (US)
Title of host publicationISCA 2023 - Proceedings of the 2023 50th Annual International Symposium on Computer Architecture
PublisherInstitute of Electrical and Electronics Engineers Inc.
Number of pages15
ISBN (Electronic)9798400700958
StatePublished - Jun 17 2023
Event50th Annual International Symposium on Computer Architecture, ISCA 2023 - Orlando, United States
Duration: Jun 17 2023Jun 21 2023

Publication series

NameProceedings - International Symposium on Computer Architecture
ISSN (Print)1063-6897


Conference50th Annual International Symposium on Computer Architecture, ISCA 2023
Country/TerritoryUnited States


  • ML Accelerator
  • Multi-tenancy
  • Neural Processing Unit

ASJC Scopus subject areas

  • Hardware and Architecture


Dive into the research topics of 'V10: Hardware-Assisted NPU Multi-tenancy for Improved Resource Utilization and Fairness'. Together they form a unique fingerprint.

Cite this