TY - GEN
T1 - Mozart
T2 - 33rd International Conference on Parallel Architectures and Compilation Techniques, PACT 2024
AU - Suresh, Vignesh
AU - Mishra, Bakshree
AU - Jing, Ying
AU - Zhu, Zeran
AU - Jin, Naiyin
AU - Block, Charles
AU - Mantovani, Paolo
AU - Giri, Davide
AU - Zuckerman, Joseph
AU - Carloni, Luca P.
AU - Adve, Sarita V.
N1 - Publisher Copyright:
© 2024 ACM.
PY - 2024
Y1 - 2024
N2 - Resource-constrained system-on-chips (SoCs) are increasingly heterogeneous with specialized accelerators for various tasks. Acceleration taxes due to control and data movement, however, diminish end-to-end speedups from hardware acceleration. Meanwhile, emerging workloads are increasingly task-diverse with several, potentially shared, fine-grained acceleration candidates. This motivates a paradigm of parallel and disaggregated acceleration. Compared to a monolithic accelerator, disaggregation provides higher flexibility, reuse, and utilization, but at the cost of higher control and data acceleration taxes. We propose a novel SoC architecture, Mozart, that enables efficient accelerator disaggregation by leveraging shared-memory to tame control and data acceleration taxes. To address the control tax, Mozart includes a lightweight, modular, and general accelerator synchronization interface (ASI). ASI eliminates the typical CPU-centric accelerator control in favor of a decentralized, uniform synchronization interface through shared-memory. This enables accelerators to directly and transparently synchronize with each other (or CPUs) using the same shared-memory interface as CPUs. To address the data tax, Mozart leverages the Spandex-FCS heterogeneous coherence protocol, which supports decentralized data movement and per-word coherence specialization. We demonstrate the first RTL implementation of Spandex-FCS and the first evaluation of its benefits for a heterogeneous SoC with fixed-function accelerators, running real-world applications with Linux. Mozart simultaneously enables, for the first time, (1) finer-grained acceler-ation than previously possible, (2) programmable and transparent composition of fine-grained, disaggregated accelerators, (3) efficient accelerator pipelining through shared-memory and decentralization, and (4) a performance-competitive disaggregated alternative to specialized monolithic accelerators. We demonstrate these capabilities of Mozart with a comprehensive one-of-a-kind evaluation of more than 70 hardware configurations prototyped on an FPGA employing various accelerators, running real-world applications on Linux, and a scalability analysis with up to 15 accelerators. We also present an analytical performance model to understand and explore system design choices and to validate the results.
AB - Resource-constrained system-on-chips (SoCs) are increasingly heterogeneous with specialized accelerators for various tasks. Acceleration taxes due to control and data movement, however, diminish end-to-end speedups from hardware acceleration. Meanwhile, emerging workloads are increasingly task-diverse with several, potentially shared, fine-grained acceleration candidates. This motivates a paradigm of parallel and disaggregated acceleration. Compared to a monolithic accelerator, disaggregation provides higher flexibility, reuse, and utilization, but at the cost of higher control and data acceleration taxes. We propose a novel SoC architecture, Mozart, that enables efficient accelerator disaggregation by leveraging shared-memory to tame control and data acceleration taxes. To address the control tax, Mozart includes a lightweight, modular, and general accelerator synchronization interface (ASI). ASI eliminates the typical CPU-centric accelerator control in favor of a decentralized, uniform synchronization interface through shared-memory. This enables accelerators to directly and transparently synchronize with each other (or CPUs) using the same shared-memory interface as CPUs. To address the data tax, Mozart leverages the Spandex-FCS heterogeneous coherence protocol, which supports decentralized data movement and per-word coherence specialization. We demonstrate the first RTL implementation of Spandex-FCS and the first evaluation of its benefits for a heterogeneous SoC with fixed-function accelerators, running real-world applications with Linux. Mozart simultaneously enables, for the first time, (1) finer-grained acceler-ation than previously possible, (2) programmable and transparent composition of fine-grained, disaggregated accelerators, (3) efficient accelerator pipelining through shared-memory and decentralization, and (4) a performance-competitive disaggregated alternative to specialized monolithic accelerators. We demonstrate these capabilities of Mozart with a comprehensive one-of-a-kind evaluation of more than 70 hardware configurations prototyped on an FPGA employing various accelerators, running real-world applications on Linux, and a scalability analysis with up to 15 accelerators. We also present an analytical performance model to understand and explore system design choices and to validate the results.
KW - Accelerator Synchronization
KW - Cache Coherence
KW - Disaggregated Acceleration
KW - Heterogeneous Systems
KW - Shared-Memory
UR - http://www.scopus.com/inward/record.url?scp=85215517308&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85215517308&partnerID=8YFLogxK
U2 - 10.1145/3656019.3676896
DO - 10.1145/3656019.3676896
M3 - Conference contribution
AN - SCOPUS:85215517308
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 183
EP - 200
BT - PACT 2024 - Proceedings of the 2024 International Conference on Parallel Architectures and Compilation Techniques
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 13 October 2024 through 16 October 2024
ER -