TY - GEN
T1 - Planaria
T2 - 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020
AU - Ghodrati, Soroush
AU - Ahn, Byung Hoon
AU - Kim, Joon Kyung
AU - Kinzer, Sean
AU - Yatham, Brahmendra Reddy
AU - Alla, Navateja
AU - Sharma, Hardik
AU - Alian, Mohammad
AU - Ebrahimi, Eiman
AU - Kim, Nam Sung
AU - Young, Cliff
AU - Esmaeilzadeh, Hadi
N1 - Publisher Copyright:
© 2020 IEEE Computer Society. All rights reserved.
PY - 2020/10
Y1 - 2020/10
N2 - Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries and markets. Cloud infrastructure and accelerators that offer INFerence-as-a-Service (INFaaS) have become the enabler of this rather quick and invasive shift in the industry. To that end, mostly acceleratorbased INFaaS (Google's TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has become the backbone of many real-life applications. However, as the demand for such services grows, merely scaling-out the number of accelerators is not economically cost-effective. Although multi-tenancy has propelled datacenter scalability, it has not been a primary factor in designing DNN accelerators due to the arms race for higher speed and efficiency. This paper sets out to explore this timely requirement of multitenancy through a new dimension: Dynamic architecture fission. To that end, we define Planaria1 that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime. This microarchitectural capability enables spatially colocating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration. To realize this dynamic reconfigurability, we first devise breakable omnidirectional systolic arrays for DNN acceleration that allows omnidirectional flow of data. Second, it uses this capability and a unique organization of on-chip memory, interconnection, and compute resources to enable fission in systolic array based DNN accelerators. Architecture fission and its associated flexibility enables an extra degree of freedom for task scheduling, that even allows breaking the accelerator with regard to the server load, DNN topology, and task priority. As such, it can simultaneously co-locate DNNs to enhance utilization, throughput, QoS, and fairness. We compare the proposed design to PREMA [4], a recent effort that offers multi-tenancy by time-multiplexing the DNN accelerator across multiple tasks. We use the same frequency, the same amount of compute and memory resources for both accelerators. The results show significant benefits with (soft, medium, hard) QoS requirements, in throughput (7:4×, 7:2×, 12:2×), SLA satisfaction rate (45%, 15%, 16%), and fairness (2:1×, 2:3×, 1:9×).
AB - Deep Neural Networks (DNNs) have reinvigorated real-world applications that rely on learning patterns of data and are permeating into different industries and markets. Cloud infrastructure and accelerators that offer INFerence-as-a-Service (INFaaS) have become the enabler of this rather quick and invasive shift in the industry. To that end, mostly acceleratorbased INFaaS (Google's TPU [1], NVIDIA T4 [2], Microsoft Brainwave [3], etc.) has become the backbone of many real-life applications. However, as the demand for such services grows, merely scaling-out the number of accelerators is not economically cost-effective. Although multi-tenancy has propelled datacenter scalability, it has not been a primary factor in designing DNN accelerators due to the arms race for higher speed and efficiency. This paper sets out to explore this timely requirement of multitenancy through a new dimension: Dynamic architecture fission. To that end, we define Planaria1 that can dynamically fission (break) into multiple smaller yet full-fledged DNN engines at runtime. This microarchitectural capability enables spatially colocating multiple DNN inference services on the same hardware, offering simultaneous multi-tenant DNN acceleration. To realize this dynamic reconfigurability, we first devise breakable omnidirectional systolic arrays for DNN acceleration that allows omnidirectional flow of data. Second, it uses this capability and a unique organization of on-chip memory, interconnection, and compute resources to enable fission in systolic array based DNN accelerators. Architecture fission and its associated flexibility enables an extra degree of freedom for task scheduling, that even allows breaking the accelerator with regard to the server load, DNN topology, and task priority. As such, it can simultaneously co-locate DNNs to enhance utilization, throughput, QoS, and fairness. We compare the proposed design to PREMA [4], a recent effort that offers multi-tenancy by time-multiplexing the DNN accelerator across multiple tasks. We use the same frequency, the same amount of compute and memory resources for both accelerators. The results show significant benefits with (soft, medium, hard) QoS requirements, in throughput (7:4×, 7:2×, 12:2×), SLA satisfaction rate (45%, 15%, 16%), and fairness (2:1×, 2:3×, 1:9×).
KW - Accelerators
KW - DNN
KW - DNN Acceleration
KW - Deep Neural Networks
KW - Dynamic Architecture Fission
KW - Multi-Tenancy
KW - Multi-Tenant DNN Acceleration
KW - Omni-Directional Systolic Arrays
KW - Spatial DNN Task Co-Location
UR - http://www.scopus.com/inward/record.url?scp=85097352748&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85097352748&partnerID=8YFLogxK
U2 - 10.1109/MICRO50266.2020.00062
DO - 10.1109/MICRO50266.2020.00062
M3 - Conference contribution
AN - SCOPUS:85097352748
T3 - Proceedings of the Annual International Symposium on Microarchitecture, MICRO
SP - 681
EP - 697
BT - Proceedings - 2020 53rd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2020
PB - IEEE Computer Society
Y2 - 17 October 2020 through 21 October 2020
ER -