TrIMS: Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service

Abdul Dakkak, Cheng Li, Simon Garcia De Gonzalo, Jinjun Xiong, Wen-Mei W Hwu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines. Cloud computing, as the defacto backbone of modern computing infrastructure, has to be able to handle user-defined FaaS pipelines containing diverse DNN inference workloads while maintaining isolation and latency guarantees with minimal resource waste. The current solution for guaranteeing isolation and latency within FaaS is inefficient. A major cause of the inefficiency is the need to move large amount of data within and across servers. We propose TrIMS as a novel solution to address this issue. TrIMSis a generic memory sharing technique that enables constant data to be shared across processes or containers while still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of abstracts, applicationAPIs, and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models, up to 210x speedup for large models, and up to8×system throughput improvement.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE International Conference on Cloud Computing, CLOUD 2019 - Part of the 2019 IEEE World Congress on Services
EditorsElisa Bertino, Carl K. Chang, Peter Chen, Ernesto Damiani, Michael Goul, Katsunori Oyama
PublisherIEEE Computer Society
Pages372-382
Number of pages11
ISBN (Electronic)9781728127057
DOIs
StatePublished - Jul 2019
Event12th IEEE International Conference on Cloud Computing, CLOUD 2019 - Milan, Italy
Duration: Jul 8 2019Jul 13 2019

Publication series

NameIEEE International Conference on Cloud Computing, CLOUD
Volume2019-July
ISSN (Print)2159-6182
ISSN (Electronic)2159-6190

Conference

Conference12th IEEE International Conference on Cloud Computing, CLOUD 2019
CountryItaly
CityMilan
Period7/8/197/13/19

Fingerprint

Containers
Pipelines
Image classification
Cloud computing
Program processors
Servers
Throughput
Data storage equipment
Deep learning
Deep neural networks
Graphics processing unit

Keywords

  • Cloud
  • Inference
  • Machine Learning
  • Memory

ASJC Scopus subject areas

  • Artificial Intelligence
  • Information Systems
  • Software

Cite this

Dakkak, A., Li, C., De Gonzalo, S. G., Xiong, J., & Hwu, W-M. W. (2019). TrIMS: Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service. In E. Bertino, C. K. Chang, P. Chen, E. Damiani, M. Goul, & K. Oyama (Eds.), Proceedings - 2019 IEEE International Conference on Cloud Computing, CLOUD 2019 - Part of the 2019 IEEE World Congress on Services (pp. 372-382). [8814494] (IEEE International Conference on Cloud Computing, CLOUD; Vol. 2019-July). IEEE Computer Society. https://doi.org/10.1109/CLOUD.2019.00067

TrIMS : Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service. / Dakkak, Abdul; Li, Cheng; De Gonzalo, Simon Garcia; Xiong, Jinjun; Hwu, Wen-Mei W.

Proceedings - 2019 IEEE International Conference on Cloud Computing, CLOUD 2019 - Part of the 2019 IEEE World Congress on Services. ed. / Elisa Bertino; Carl K. Chang; Peter Chen; Ernesto Damiani; Michael Goul; Katsunori Oyama. IEEE Computer Society, 2019. p. 372-382 8814494 (IEEE International Conference on Cloud Computing, CLOUD; Vol. 2019-July).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Dakkak, A, Li, C, De Gonzalo, SG, Xiong, J & Hwu, W-MW 2019, TrIMS: Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service. in E Bertino, CK Chang, P Chen, E Damiani, M Goul & K Oyama (eds), Proceedings - 2019 IEEE International Conference on Cloud Computing, CLOUD 2019 - Part of the 2019 IEEE World Congress on Services., 8814494, IEEE International Conference on Cloud Computing, CLOUD, vol. 2019-July, IEEE Computer Society, pp. 372-382, 12th IEEE International Conference on Cloud Computing, CLOUD 2019, Milan, Italy, 7/8/19. https://doi.org/10.1109/CLOUD.2019.00067
Dakkak A, Li C, De Gonzalo SG, Xiong J, Hwu W-MW. TrIMS: Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service. In Bertino E, Chang CK, Chen P, Damiani E, Goul M, Oyama K, editors, Proceedings - 2019 IEEE International Conference on Cloud Computing, CLOUD 2019 - Part of the 2019 IEEE World Congress on Services. IEEE Computer Society. 2019. p. 372-382. 8814494. (IEEE International Conference on Cloud Computing, CLOUD). https://doi.org/10.1109/CLOUD.2019.00067
Dakkak, Abdul ; Li, Cheng ; De Gonzalo, Simon Garcia ; Xiong, Jinjun ; Hwu, Wen-Mei W. / TrIMS : Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service. Proceedings - 2019 IEEE International Conference on Cloud Computing, CLOUD 2019 - Part of the 2019 IEEE World Congress on Services. editor / Elisa Bertino ; Carl K. Chang ; Peter Chen ; Ernesto Damiani ; Michael Goul ; Katsunori Oyama. IEEE Computer Society, 2019. pp. 372-382 (IEEE International Conference on Cloud Computing, CLOUD).
@inproceedings{cb5205afdced4377b8f5b3c12a24f831,
title = "TrIMS: Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service",
abstract = "Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines. Cloud computing, as the defacto backbone of modern computing infrastructure, has to be able to handle user-defined FaaS pipelines containing diverse DNN inference workloads while maintaining isolation and latency guarantees with minimal resource waste. The current solution for guaranteeing isolation and latency within FaaS is inefficient. A major cause of the inefficiency is the need to move large amount of data within and across servers. We propose TrIMS as a novel solution to address this issue. TrIMSis a generic memory sharing technique that enables constant data to be shared across processes or containers while still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of abstracts, applicationAPIs, and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models, up to 210x speedup for large models, and up to8×system throughput improvement.",
keywords = "Cloud, Inference, Machine Learning, Memory",
author = "Abdul Dakkak and Cheng Li and {De Gonzalo}, {Simon Garcia} and Jinjun Xiong and Hwu, {Wen-Mei W}",
year = "2019",
month = "7",
doi = "10.1109/CLOUD.2019.00067",
language = "English (US)",
series = "IEEE International Conference on Cloud Computing, CLOUD",
publisher = "IEEE Computer Society",
pages = "372--382",
editor = "Elisa Bertino and Chang, {Carl K.} and Peter Chen and Ernesto Damiani and Michael Goul and Katsunori Oyama",
booktitle = "Proceedings - 2019 IEEE International Conference on Cloud Computing, CLOUD 2019 - Part of the 2019 IEEE World Congress on Services",

}

TY - GEN

T1 - TrIMS

T2 - Transparent and isolated model sharing for low latency deep learning inference in function-as-a-service

AU - Dakkak, Abdul

AU - Li, Cheng

AU - De Gonzalo, Simon Garcia

AU - Xiong, Jinjun

AU - Hwu, Wen-Mei W

PY - 2019/7

Y1 - 2019/7

N2 - Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines. Cloud computing, as the defacto backbone of modern computing infrastructure, has to be able to handle user-defined FaaS pipelines containing diverse DNN inference workloads while maintaining isolation and latency guarantees with minimal resource waste. The current solution for guaranteeing isolation and latency within FaaS is inefficient. A major cause of the inefficiency is the need to move large amount of data within and across servers. We propose TrIMS as a novel solution to address this issue. TrIMSis a generic memory sharing technique that enables constant data to be shared across processes or containers while still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of abstracts, applicationAPIs, and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models, up to 210x speedup for large models, and up to8×system throughput improvement.

AB - Deep neural networks (DNNs) have become core computation components within low latency Function as a Service (FaaS) prediction pipelines. Cloud computing, as the defacto backbone of modern computing infrastructure, has to be able to handle user-defined FaaS pipelines containing diverse DNN inference workloads while maintaining isolation and latency guarantees with minimal resource waste. The current solution for guaranteeing isolation and latency within FaaS is inefficient. A major cause of the inefficiency is the need to move large amount of data within and across servers. We propose TrIMS as a novel solution to address this issue. TrIMSis a generic memory sharing technique that enables constant data to be shared across processes or containers while still maintaining isolation between users. TrIMS consists of a persistent model store across the GPU, CPU, local storage, and cloud storage hierarchy, an efficient resource management layer that provides isolation, and a succinct set of abstracts, applicationAPIs, and container technologies for easy and transparent integration with FaaS, Deep Learning (DL) frameworks, and user code. We demonstrate our solution by interfacing TrIMS with the Apache MXNet framework and demonstrate up to 24x speedup in latency for image classification models, up to 210x speedup for large models, and up to8×system throughput improvement.

KW - Cloud

KW - Inference

KW - Machine Learning

KW - Memory

UR - http://www.scopus.com/inward/record.url?scp=85072329545&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85072329545&partnerID=8YFLogxK

U2 - 10.1109/CLOUD.2019.00067

DO - 10.1109/CLOUD.2019.00067

M3 - Conference contribution

AN - SCOPUS:85072329545

T3 - IEEE International Conference on Cloud Computing, CLOUD

SP - 372

EP - 382

BT - Proceedings - 2019 IEEE International Conference on Cloud Computing, CLOUD 2019 - Part of the 2019 IEEE World Congress on Services

A2 - Bertino, Elisa

A2 - Chang, Carl K.

A2 - Chen, Peter

A2 - Damiani, Ernesto

A2 - Goul, Michael

A2 - Oyama, Katsunori

PB - IEEE Computer Society

ER -