TY - JOUR
T1 - An experimental evaluation of the HP V-class and SGI origin 2000 multiprocessors using microbenchmarks and scientific applications
AU - Iyer, Ravi
AU - Perdue, Jack
AU - Rauchwerger, Lawrence
AU - Amato, Nancy M.
AU - Bhuyan, Laxmi
N1 - Funding Information:
As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, performance analysts perform benchmark analysis on existing multiprocessors. In this study, we present a detailed comparison of two architectures, the HP V-Class and the SGI Origin 2000. Our goal is to compare and contrast design techniques used in these multiprocessors. We 1Intel Corporation. E-mail: [email protected] 2Parasol Laboratory, Department of Computer Science, Texas A&M University College Station, TX 77843-3112, USA. E-mail: {jkp2866, amato, rwerger}@cs.tamu.edu 3Department of Computer Science and Engineering, University of California Riverside, Riverside, CA 92521, USA. E-mail: [email protected] 4To whom correspondence should be addressed. ∗A preliminary version of this paper appeared in the 13th ACM International Conference on Supercomputing (ICS’99).(13) This work was done while Iyer and Bhuyan were at Texas A&M. It was supported in part by a Hewlett-Packard Equipment Grant. Amato and Rauchwerger supported in part by NSF Grants ACI-9872126, EIA-9975018, EIA-0103742, EIA-9805823, ACR-0081510, ACR-0113971, CCR-0113974, EIA-9810937, EIA-0079874, by the DOE ASCI ASAP program, and by the Texas Higher Education Coordinating Board grant ATP-000512-0261-2001. Perdue supported in part by a Dept. of Education Graduate Fellowship (GAANN).
PY - 2005/8
Y1 - 2005/8
N2 - As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, performance analysts perform benchmark analysis on existing multiprocessors. In this study, we present a detailed comparison of two architectures, the HP V-Class and the SGI Origin 2000. Our goal is to compare and contrast design techniques used in these multiprocessors. We present the impact of processor design, cache/memory hierarchies and coherence protocol optimizations on the memory system performance of these multiprocessors. We also study the effect of parallelism overheads such as process creation and synchronization on the user-level performance of these multiprocessors. Our experimental methodology uses microbenchmarks as well as scientific applications to characterize the user-level performance. Our microbenchmark results show the impact of Ll/L2 cache size and TLB size on uniprocessor load/store latencies, the effect of coherence protocol design/optimizations and data sharing patterns on multiprocessor memory access latencies and finally the overhead of parallelism. Our application-based evaluation shows the impact of problem size, dominant sharing patterns and number of Processors used on speedup and raw execution time. Finally, we use hardware counter measurements to study the correlation of system-level performance metrics and the application's execution time performance.
AB - As processor technology continues to advance at a rapid pace, the principal performance bottleneck of shared memory systems has become the memory access latency. In order to understand the effects of cache and memory hierarchy on system latencies, performance analysts perform benchmark analysis on existing multiprocessors. In this study, we present a detailed comparison of two architectures, the HP V-Class and the SGI Origin 2000. Our goal is to compare and contrast design techniques used in these multiprocessors. We present the impact of processor design, cache/memory hierarchies and coherence protocol optimizations on the memory system performance of these multiprocessors. We also study the effect of parallelism overheads such as process creation and synchronization on the user-level performance of these multiprocessors. Our experimental methodology uses microbenchmarks as well as scientific applications to characterize the user-level performance. Our microbenchmark results show the impact of Ll/L2 cache size and TLB size on uniprocessor load/store latencies, the effect of coherence protocol design/optimizations and data sharing patterns on multiprocessor memory access latencies and finally the overhead of parallelism. Our application-based evaluation shows the impact of problem size, dominant sharing patterns and number of Processors used on speedup and raw execution time. Finally, we use hardware counter measurements to study the correlation of system-level performance metrics and the application's execution time performance.
KW - Parallel architectures
KW - Performance analysis
KW - Shared memory
UR - http://www.scopus.com/inward/record.url?scp=24144464523&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=24144464523&partnerID=8YFLogxK
U2 - 10.1007/s10766-004-1187-0
DO - 10.1007/s10766-004-1187-0
M3 - Article
AN - SCOPUS:24144464523
SN - 0885-7458
VL - 33
SP - 307
EP - 350
JO - International Journal of Parallel Programming
JF - International Journal of Parallel Programming
IS - 4
ER -