TY - GEN
T1 - Design trade-offs in high-throughput coherence controllers
AU - Nguyen, A. T.
AU - Torrellas, J.
N1 - Publisher Copyright:
© 2003 IEEE.
PY - 2003
Y1 - 2003
N2 - Recent research shows that the high occupancy of coherence controllers (CCs) is a major performance bottleneck in scalable shared-memory multiprocessors. We propose to take microarchitectural enhancements used for microprocessors and apply them to improve the throughput of hardwired CCs. These enhancements are CC support for nonblocking execution, early fetches of directory and L3 information, and superpipelining. Nonblocking execution in the CC reduces stalls by processing subsequent coherence transactions in the presence of misses in the directory cache and tag cache. Early fetching in the CC hides misses in the directory and tag caches and, therefore, also removes stalls. Finally, superpipelining in the CC increases its processing bandwidth. These supports all serve to increase the overall throughput of CCs and improve overall system performance. Using both SPLASH-2 and parallelized SPEC95 applications on detailed simulation models, we show that CCs that support nonblocking execution and superpipelining boost the performance of machines substantially. With these CCs, a 64-processor machine with four nodes of four SMPs per node runs on average 3.56 times faster than if it used conventional CCs. In addition, the machine runs about as fast as a more costly 64-processor machine with sixteen nodes of one SMP per node and the same advanced CCs. This is despite using much less network, chassis, and node hardware. Consequently, with our proposed advanced CCs, we can reduce the system cost significantly without affecting performance.
AB - Recent research shows that the high occupancy of coherence controllers (CCs) is a major performance bottleneck in scalable shared-memory multiprocessors. We propose to take microarchitectural enhancements used for microprocessors and apply them to improve the throughput of hardwired CCs. These enhancements are CC support for nonblocking execution, early fetches of directory and L3 information, and superpipelining. Nonblocking execution in the CC reduces stalls by processing subsequent coherence transactions in the presence of misses in the directory cache and tag cache. Early fetching in the CC hides misses in the directory and tag caches and, therefore, also removes stalls. Finally, superpipelining in the CC increases its processing bandwidth. These supports all serve to increase the overall throughput of CCs and improve overall system performance. Using both SPLASH-2 and parallelized SPEC95 applications on detailed simulation models, we show that CCs that support nonblocking execution and superpipelining boost the performance of machines substantially. With these CCs, a 64-processor machine with four nodes of four SMPs per node runs on average 3.56 times faster than if it used conventional CCs. In addition, the machine runs about as fast as a more costly 64-processor machine with sixteen nodes of one SMP per node and the same advanced CCs. This is despite using much less network, chassis, and node hardware. Consequently, with our proposed advanced CCs, we can reduce the system cost significantly without affecting performance.
UR - http://www.scopus.com/inward/record.url?scp=84968911254&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84968911254&partnerID=8YFLogxK
U2 - 10.1109/PACT.2003.1238015
DO - 10.1109/PACT.2003.1238015
M3 - Conference contribution
AN - SCOPUS:84968911254
T3 - Parallel Architectures and Compilation Techniques - Conference Proceedings, PACT
SP - 194
EP - 205
BT - Proceedings - 12th International Conference on Parallel Architectures and Compilation Techniques, PACT 2003
PB - Institute of Electrical and Electronics Engineers Inc.
T2 - 12th International Conference on Parallel Architectures and Compilation Techniques, PACT 2003
Y2 - 27 September 2003 through 1 October 2003
ER -