TY - GEN
T1 - Exploiting GPU peak-power and performance tradeoffs through reduced effective pipeline latency
AU - Gilani, Syed Zohaib
AU - Kim, Nam Sung
AU - Schulte, Michael J.
PY - 2013/12/1
Y1 - 2013/12/1
N2 - Modern GPUs share limited hardware resources, such as register files, among a large number of concurrently executing threads. For efficient resource sharing, several buffering and collision avoidance stages are inserted in the GPU pipeline. These additional stages increase the read-after-write (RAW) latencies of instructions. Since GPUs are often architected to hide RAW latencies through extensive multithreading, they typically do not employ power-hungry data-forwarding networks (DFNs). However, we observe that many GPGPU applications do not have enough active threads that are ready to issue instructions to hide these RAW latencies. In this paper, we first demonstrate that DFNs can considerably improve the performance of many compute-intensive GPGPU applications and then propose most recent result forwarding (MoRF) as a low-power alternative to the DFN. Second, for floating-point (FP) operations, we exploit a high-throughput fused multiply-add (HFMA) unit to further reduce both RAW latencies and the number of FMA units in the GPU without impacting instruction throughput. MoRF and HFMA together provide a geometric mean performance improvement of 18% and 29% for integer/single-precision and double-precision GPGPU applications, respectively. Finally, both MoRF and HFMA allow the GPU to effectively mimic a shallower pipeline for a large percentage of instructions. Exploiting such a benefit, we propose low-power pipelines that can reduce peak power consumption by 14% without affecting the performance or increasing the complexity of the forwarding network. The peak power reduction allows GPUs to operate more cores within the same power budget, achieving a geometric mean performance improvement of 33% for double-precision GPGPU applications.
AB - Modern GPUs share limited hardware resources, such as register files, among a large number of concurrently executing threads. For efficient resource sharing, several buffering and collision avoidance stages are inserted in the GPU pipeline. These additional stages increase the read-after-write (RAW) latencies of instructions. Since GPUs are often architected to hide RAW latencies through extensive multithreading, they typically do not employ power-hungry data-forwarding networks (DFNs). However, we observe that many GPGPU applications do not have enough active threads that are ready to issue instructions to hide these RAW latencies. In this paper, we first demonstrate that DFNs can considerably improve the performance of many compute-intensive GPGPU applications and then propose most recent result forwarding (MoRF) as a low-power alternative to the DFN. Second, for floating-point (FP) operations, we exploit a high-throughput fused multiply-add (HFMA) unit to further reduce both RAW latencies and the number of FMA units in the GPU without impacting instruction throughput. MoRF and HFMA together provide a geometric mean performance improvement of 18% and 29% for integer/single-precision and double-precision GPGPU applications, respectively. Finally, both MoRF and HFMA allow the GPU to effectively mimic a shallower pipeline for a large percentage of instructions. Exploiting such a benefit, we propose low-power pipelines that can reduce peak power consumption by 14% without affecting the performance or increasing the complexity of the forwarding network. The peak power reduction allows GPUs to operate more cores within the same power budget, achieving a geometric mean performance improvement of 33% for double-precision GPGPU applications.
KW - GPUs
KW - low-power
KW - pipeline latencies
UR - http://www.scopus.com/inward/record.url?scp=84892495332&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84892495332&partnerID=8YFLogxK
U2 - 10.1145/2540708.2540716
DO - 10.1145/2540708.2540716
M3 - Conference contribution
AN - SCOPUS:84892495332
SN - 9781450326384
T3 - MICRO 2013 - Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
SP - 74
EP - 85
BT - MICRO 2013 - Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture
T2 - 46th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 2013
Y2 - 7 December 2013 through 11 December 2013
ER -