TY - JOUR
T1 - Removing architectural bottlenecks to the scalability of speculative parallelization
AU - Prvulovic, Milos
AU - Garzarén, María J.
AU - Rauchwerger, Lawrence
AU - Torrellas, Josep
PY - 2001
Y1 - 2001
N2 - Speculative thread-level parallelization is a promising way to speed up codes that compilers fail to parallelize. While several speculative parallelization schemes have been proposed for different machine sizes and types of codes, the results so far show that it is hard to deliver scalable speedups. Often, the problem is not true dependence violations, but sub-optimal architectural design. Consequently, we attempt to identify and eliminate major architectural bottlenecks that limit the scalability of speculative parallelization. The solutions that we propose are: low-complexity commit in constant time to eliminate the task commit bottleneck, a memory-based overflow area to eliminate stall due to speculative buffer overflow, and exploiting high-level access patterns to minimize speculation-induced traffic. To show that the resulting system is truly scalable, we perform simulations with up to 128 processors. With our optimizations, the speedups for 128 and 64 processors reach 63 and 48, respectively. The average speedup for 64 processors is 32, nearly four times higher than without our optimizations.
AB - Speculative thread-level parallelization is a promising way to speed up codes that compilers fail to parallelize. While several speculative parallelization schemes have been proposed for different machine sizes and types of codes, the results so far show that it is hard to deliver scalable speedups. Often, the problem is not true dependence violations, but sub-optimal architectural design. Consequently, we attempt to identify and eliminate major architectural bottlenecks that limit the scalability of speculative parallelization. The solutions that we propose are: low-complexity commit in constant time to eliminate the task commit bottleneck, a memory-based overflow area to eliminate stall due to speculative buffer overflow, and exploiting high-level access patterns to minimize speculation-induced traffic. To show that the resulting system is truly scalable, we perform simulations with up to 128 processors. With our optimizations, the speedups for 128 and 64 processors reach 63 and 48, respectively. The average speedup for 64 processors is 32, nearly four times higher than without our optimizations.
UR - http://www.scopus.com/inward/record.url?scp=0034852757&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=0034852757&partnerID=8YFLogxK
U2 - 10.1109/ISCA.2001.937450
DO - 10.1109/ISCA.2001.937450
M3 - Article
AN - SCOPUS:0034852757
SP - 204
EP - 215
JO - Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA
JF - Conference Proceedings - Annual International Symposium on Computer Architecture, ISCA
SN - 1063-6897
ER -