TY - GEN
T1 - Hybrid latency tolerance for robust energy-efficiency on 1000-core data parallel processors
AU - Crago, Neal C.
AU - Azizi, Omid
AU - Lumetta, Steven S.
AU - Patel, Sanjay J.
PY - 2013
Y1 - 2013
N2 - Currently, GPUs and data parallel processors leverage latency tolerance techniques such as multithreading and prefetching to maximize performance per Watt. However, choosing a technique that provides energy-efficiency on a wide variety of workloads is difficult, as the type of latency to tolerate, required hardware complexity, and energy consumption is directly related to application behavior. After qualitatively evaluating five commonly used latency tolerance techniques, we develop a hybrid technique utilizing multithreading and decoupled execution to maximize performance while minimizing hardware complexity and energy consumption across a wide variety of workloads. We compare our hybrid technique with the five commonly used techniques on a 1024-core data parallel processor by performing a comprehensive design space exploration, leveraging detailed performance and physical design models. By intelligently leveraging both decoupled execution and multithreading, our hybrid latency tolerance technique is able to improve energy-efficiency by 28% to 89% over any single technique on data parallel benchmarks. Compared to other combinations of latency tolerance techniques, we find that our hybrid latency tolerance technique provides the highest energy-efficiency by over 26%.
AB - Currently, GPUs and data parallel processors leverage latency tolerance techniques such as multithreading and prefetching to maximize performance per Watt. However, choosing a technique that provides energy-efficiency on a wide variety of workloads is difficult, as the type of latency to tolerate, required hardware complexity, and energy consumption is directly related to application behavior. After qualitatively evaluating five commonly used latency tolerance techniques, we develop a hybrid technique utilizing multithreading and decoupled execution to maximize performance while minimizing hardware complexity and energy consumption across a wide variety of workloads. We compare our hybrid technique with the five commonly used techniques on a 1024-core data parallel processor by performing a comprehensive design space exploration, leveraging detailed performance and physical design models. By intelligently leveraging both decoupled execution and multithreading, our hybrid latency tolerance technique is able to improve energy-efficiency by 28% to 89% over any single technique on data parallel benchmarks. Compared to other combinations of latency tolerance techniques, we find that our hybrid latency tolerance technique provides the highest energy-efficiency by over 26%.
UR - http://www.scopus.com/inward/record.url?scp=84880313328&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84880313328&partnerID=8YFLogxK
U2 - 10.1109/HPCA.2013.6522327
DO - 10.1109/HPCA.2013.6522327
M3 - Conference contribution
AN - SCOPUS:84880313328
SN - 9781467355858
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 294
EP - 305
BT - 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013
T2 - 19th IEEE International Symposium on High Performance Computer Architecture, HPCA 2013
Y2 - 23 February 2013 through 27 February 2013
ER -