A Learned Performance Model for Tensor Processing Units

Sam Kaufman, Phitchaya Phothilimthana, Yanqi Zhou, Charith Mendis, Sudip Roy, Amit Sabne, Mike Burrows

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Accurate hardware performance models are critical to efficient code generation. They can be used by compilers to make heuristic decisions, by superoptimizers as a minimization objective, or by autotuners to find an optimal configuration for a specific program. However, they are difficult to develop because contemporary processors are complex, and the recent proliferation of deep learning accelerators has increased the development burden. We demonstrate a method of learning performance models from a corpus of tensor computation graph programs for Tensor Processing Unit (TPU) instances. We show that our learned model outperforms a heavily-optimized analytical performance model on two tasks—tile-size selection and operator fusion—and that it helps an autotuner discover faster programs in a setting where access to TPUs is limited or expensive.
Original languageEnglish (US)
Title of host publicationProceedings of Machine Learning and Systems
EditorsA. Smola, A. Dimakis, I. Stoica
Pages387-400
Number of pages14
Volume3
StatePublished - 2021
Externally publishedYes

Fingerprint

Dive into the research topics of 'A Learned Performance Model for Tensor Processing Units'. Together they form a unique fingerprint.

Cite this