TY - GEN
T1 - Enabling Real-time AI Inference on Mobile Devices via GPU-CPU Collaborative Execution
AU - Li, Hao
AU - Ng, Joseph K.
AU - Abdelzaher, Tarek
N1 - Publisher Copyright:
© 2022 IEEE.
PY - 2022
Y1 - 2022
N2 - AI-powered mobile applications are becoming increasingly popular due to recent advances in machine intelligence. They include, but are not limited to mobile sensing, virtual assistants, and augmented reality. Mobile AI models, especially Deep Neural Networks (DNN), are usually executed locally, as sensory data are collected and generated by end devices. This imposes a heavy computational burden on the resource-constrained mobile phones. There are usually a set of DNN jobs with deadline constraints waiting for execution. Existing AI inference frameworks process incoming DNN jobs in sequential order, which does not optimally support mobile users' real-time interactions with AI services. In this paper, we propose a framework to achieve real-time inference by exploring the heterogeneous mobile SoCs, which contain a CPU and a GPU. Considering characteristics of DNN models, we optimally partition the execution between the mobile GPU and CPU. We present a dynamic programming-based approach to solve the formulated real-time DNN partitioning and scheduling problem. The proposed framework has several desirable properties: 1) computational resources on mobile devices are better utilized; 2) it optimizes inference performance in terms of deadline miss rate; 3) no sacrifices in inference accuracy are made. Evaluation results on an off-the-shelf mobile phone show that our proposed framework can provide better real-time support for AI inference tasks on mobile platforms, compared to several baselines.
AB - AI-powered mobile applications are becoming increasingly popular due to recent advances in machine intelligence. They include, but are not limited to mobile sensing, virtual assistants, and augmented reality. Mobile AI models, especially Deep Neural Networks (DNN), are usually executed locally, as sensory data are collected and generated by end devices. This imposes a heavy computational burden on the resource-constrained mobile phones. There are usually a set of DNN jobs with deadline constraints waiting for execution. Existing AI inference frameworks process incoming DNN jobs in sequential order, which does not optimally support mobile users' real-time interactions with AI services. In this paper, we propose a framework to achieve real-time inference by exploring the heterogeneous mobile SoCs, which contain a CPU and a GPU. Considering characteristics of DNN models, we optimally partition the execution between the mobile GPU and CPU. We present a dynamic programming-based approach to solve the formulated real-time DNN partitioning and scheduling problem. The proposed framework has several desirable properties: 1) computational resources on mobile devices are better utilized; 2) it optimizes inference performance in terms of deadline miss rate; 3) no sacrifices in inference accuracy are made. Evaluation results on an off-the-shelf mobile phone show that our proposed framework can provide better real-time support for AI inference tasks on mobile platforms, compared to several baselines.
UR - http://www.scopus.com/inward/record.url?scp=85142073005&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85142073005&partnerID=8YFLogxK
U2 - 10.1109/RTCSA55878.2022.00027
DO - 10.1109/RTCSA55878.2022.00027
M3 - Conference contribution
AN - SCOPUS:85142073005
T3 - Proceedings - 2022 IEEE 28th International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2022
SP - 195
EP - 204
BT - Proceedings - 2022 IEEE 28th International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2022
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 28th IEEE International Conference on Embedded and Real-Time Computing Systems and Applications, RTCSA 2022
Y2 - 23 August 2022 through 25 August 2022
ER -