Architecting waferscale processors-A GPU case study

Saptadeep Pal, Daniel Petrisko, Matthew Tomei, Puneet Gupta, Subramanian S. Iyer, Rakesh Kumar

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today's architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture.

Original languageEnglish (US)
Title of host publicationProceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages250-263
Number of pages14
ISBN (Electronic)9781728114446
DOIs
StatePublished - Mar 26 2019
Event25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019 - Washington, United States
Duration: Feb 16 2019Feb 20 2019

Publication series

NameProceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019

Conference

Conference25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019
CountryUnited States
CityWashington
Period2/16/192/20/19

Fingerprint

Silicon
Scheduling
Graphics processing unit
Communication
Multicarrier modulation
Polychlorinated biphenyls
Silicon wafers
Energy efficiency
Computer systems
Processing

Keywords

  • GPU
  • Silicon Interconnect Fabric
  • Waferscale Processors

ASJC Scopus subject areas

  • Software
  • Hardware and Architecture
  • Computer Networks and Communications

Cite this

Pal, S., Petrisko, D., Tomei, M., Gupta, P., Iyer, S. S., & Kumar, R. (2019). Architecting waferscale processors-A GPU case study. In Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019 (pp. 250-263). [8675211] (Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/HPCA.2019.00042

Architecting waferscale processors-A GPU case study. / Pal, Saptadeep; Petrisko, Daniel; Tomei, Matthew; Gupta, Puneet; Iyer, Subramanian S.; Kumar, Rakesh.

Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019. Institute of Electrical and Electronics Engineers Inc., 2019. p. 250-263 8675211 (Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Pal, S, Petrisko, D, Tomei, M, Gupta, P, Iyer, SS & Kumar, R 2019, Architecting waferscale processors-A GPU case study. in Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019., 8675211, Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, Institute of Electrical and Electronics Engineers Inc., pp. 250-263, 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019, Washington, United States, 2/16/19. https://doi.org/10.1109/HPCA.2019.00042
Pal S, Petrisko D, Tomei M, Gupta P, Iyer SS, Kumar R. Architecting waferscale processors-A GPU case study. In Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019. Institute of Electrical and Electronics Engineers Inc. 2019. p. 250-263. 8675211. (Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019). https://doi.org/10.1109/HPCA.2019.00042
Pal, Saptadeep ; Petrisko, Daniel ; Tomei, Matthew ; Gupta, Puneet ; Iyer, Subramanian S. ; Kumar, Rakesh. / Architecting waferscale processors-A GPU case study. Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019. Institute of Electrical and Electronics Engineers Inc., 2019. pp. 250-263 (Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019).
@inproceedings{45664786d22e421eb645b6fc7434e7a9,
title = "Architecting waferscale processors-A GPU case study",
abstract = "Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today's architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100{\%} of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture.",
keywords = "GPU, Silicon Interconnect Fabric, Waferscale Processors",
author = "Saptadeep Pal and Daniel Petrisko and Matthew Tomei and Puneet Gupta and Iyer, {Subramanian S.} and Rakesh Kumar",
year = "2019",
month = "3",
day = "26",
doi = "10.1109/HPCA.2019.00042",
language = "English (US)",
series = "Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019",
publisher = "Institute of Electrical and Electronics Engineers Inc.",
pages = "250--263",
booktitle = "Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019",
address = "United States",

}

TY - GEN

T1 - Architecting waferscale processors-A GPU case study

AU - Pal, Saptadeep

AU - Petrisko, Daniel

AU - Tomei, Matthew

AU - Gupta, Puneet

AU - Iyer, Subramanian S.

AU - Kumar, Rakesh

PY - 2019/3/26

Y1 - 2019/3/26

N2 - Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today's architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture.

AB - Increasing communication overheads are already threatening computer system scaling. One approach to dramatically reduce communication overheads is waferscale processing. However, waferscale processors [1], [2], [3] have been historically deemed impractical due to yield issues [1], [4] inherent to conventional integration technology. Emerging integration technologies such as Silicon-Interconnection Fabric (Si-IF) [5], [6], [7], where pre-manufactured dies are directly bonded on to a silicon wafer, may enable one to build a waferscale system without the corresponding yield issues. As such, waferscalar architectures need to be revisited. In this paper, we study if it is feasible and useful to build today's architectures at waferscale. Using a waferscale GPU as a case study, we show that while a 300 mm wafer can house about 100 GPU modules (GPM), only a much scaled down GPU architecture with about 40 GPMs can be built when physical concerns are considered. We also study the performance and energy implications of waferscale architectures. We show that waferscale GPUs can provide significant performance and energy efficiency advantages (up to 18.9x speedup and 143x EDP benefit compared against equivalent MCM-GPU based implementation on PCB) without any change in the programming model. We also develop thread scheduling and data placement policies for waferscale GPU architectures. Our policies outperform state-of-art scheduling and data placement policies by up to 2.88x (average 1.4x) and 1.62x (average 1.11x) for 24 GPM and 40 GPM cases respectively. Finally, we build the first Si-IF prototype with interconnected dies. We observe 100% of the inter-die interconnects to be successfully connected in our prototype. Coupled with the high yield reported previously for bonding of dies on Si-IF, this demonstrates the technological readiness for building a waferscale GPU architecture.

KW - GPU

KW - Silicon Interconnect Fabric

KW - Waferscale Processors

UR - http://www.scopus.com/inward/record.url?scp=85064210105&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85064210105&partnerID=8YFLogxK

U2 - 10.1109/HPCA.2019.00042

DO - 10.1109/HPCA.2019.00042

M3 - Conference contribution

AN - SCOPUS:85064210105

T3 - Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019

SP - 250

EP - 263

BT - Proceedings - 25th IEEE International Symposium on High Performance Computer Architecture, HPCA 2019

PB - Institute of Electrical and Electronics Engineers Inc.

ER -