TY - JOUR
T1 - The impact of exploiting instruction-level parallelism on shared-memory multiprocessors
AU - Pai, Vijay S.
AU - Ranganathan, Parthasarathy
AU - Abdel-Shafi, Hazim
AU - Adve, Sarita
N1 - Funding Information:
This work is supported in part by an IBM Partnership Award, Intel Corp., the U.S. National Science Foundation under Grant No. CCR-9410457, CCR-9502500, CDA-9502791, and CDA-9617383, and the Texas Advanced Technology Program under Grant No. 003604-025. Sarita Adve is also supported by an Alfred P. Sloan Research Fellowship, Vijay S. Pai by a Fannie and John Hertz Foundation Fellowship, and Parthasarathy Ranganathan by a Lodieska Stockbridge Vaughan Fellowship.
Funding Information:
BTech degree from the Indian Institute of Technology, Madras, in 1994 and his MS degree from Rice University in 1997. He is currently a doctoral candidate in the Department of Elec-trical and Computer Engineering at Rice Uni-versity. His broad research areas are in high-performance computer architecture and perfor-mance evaluation. He is a primary developer and maintainer of the publicly distributed Rice Simulator for ILP Multiprocessors (RSIM) infrastructure. He is currently working on developing cost-effective high-performance uniprocessor and multiprocessor systems for commercial database and multimedia applications. He is a student member of the ACM and the IEEE Computer Society, a member of Eta Kappa Nu, and a recipient of the Lodieska Stockbridge Vaughan fellowship.
Funding Information:
Sarita Adve received a BTech degree in electrical engineering from the Indian Institute of Technology-Bombay in 1987, and the MS and PhD degrees in computer science from the University of Wisconsin-Madison in 1989 and 1993, respectively. She is currently an assistant professor in the Department of Electrical and Computer Engineering at Rice University. Her research interests are in computer architecture, parallel computing, and performance evaluation methods. She received a U.S. National Science Foundation CAREER award in 1995, an IBM University Partnership award in 1997 and 1998, and an Alfred P. Sloan Research Fellowship in 1998. She is an associate editor for the ACM Transactions on Modeling and Computer Simulation and has served on several conference program committees.
PY - 1999
Y1 - 1999
N2 - Current microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency-hiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are less effective in removing memory stall time. Consequently, despite the inherent latency tolerance features of ILP processors, we find memory system performance to be a larger bottleneck and parallel efficiencies to be generally poorer in ILP-based multiprocessors than in previous generation multiprocessors. The main reasons for these deficiencies are insufficient opportunities in the applications to overlap multiple load misses and increased contention for resources in the system. We also find that software prefetching does not change the memory bound nature of most of our applications on our ILP multiprocessor, mainly due to a large number of late prefetches and resource contention. Our results suggest the need for additional latency hiding or reducing techniques for ILP systems, such as software clustering of load misses and producer-initiated communication.
AB - Current microprocessors incorporate techniques to aggressively exploit instruction-level parallelism (ILP). This paper evaluates the impact of such processors on the performance of shared-memory multiprocessors, both without and with the latency-hiding optimization of software prefetching. Our results show that, while ILP techniques substantially reduce CPU time in multiprocessors, they are less effective in removing memory stall time. Consequently, despite the inherent latency tolerance features of ILP processors, we find memory system performance to be a larger bottleneck and parallel efficiencies to be generally poorer in ILP-based multiprocessors than in previous generation multiprocessors. The main reasons for these deficiencies are insufficient opportunities in the applications to overlap multiple load misses and increased contention for resources in the system. We also find that software prefetching does not change the memory bound nature of most of our applications on our ILP multiprocessor, mainly due to a large number of late prefetches and resource contention. Our results suggest the need for additional latency hiding or reducing techniques for ILP systems, such as software clustering of load misses and producer-initiated communication.
KW - Instruction-level parallelism
KW - Performance evaluation
KW - Shared-memory multiprocessors
KW - Software prefetching
UR - http://www.scopus.com/inward/record.url?scp=0033075416&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0033075416&partnerID=8YFLogxK
U2 - 10.1109/12.752663
DO - 10.1109/12.752663
M3 - Article
AN - SCOPUS:0033075416
SN - 0018-9340
VL - 48
SP - 218
EP - 226
JO - IEEE Transactions on Computers
JF - IEEE Transactions on Computers
IS - 2
ER -