TY - GEN
T1 - BulkSMT
T2 - 18th IEEE International Symposium on High Performance Computer Architecture, HPCA - 18 2012
AU - Qian, Xuehai
AU - Sahelices, Benjamin
AU - Torrellas, Josep
PY - 2012
Y1 - 2012
N2 - Multiprocessor architectures that continuously execute atomic blocks (or chunks) of instructions can improve performance and software productivity. However, all of the prior proposals for such architectures assume single-context cores as building blocks - rather than the widely-used Simultaneous Multithreading (SMT) cores. As a result, they are underutilizing hardware resources. This paper presents the first SMT design that supports continuous chunked (or transactional) execution of its contexts. Our design, called BulkSMT, can be used either in a single-core processor or in a multicore of SMTs. We present a set of BulkSMT configurations with different cost and performance. We also describe the architectural primitives that enable chunked execution in an SMT core and in a multicore of SMTs. Our results, based on simulations of SPLASH-2 and PARSEC codes, show that BulkSMT supports chunked execution cost-effectively. In a 4-core multicore with eager chunked execution, BulkSMT reduces the execution time of the applications by an average of 26% compared to running on single-context cores. In a single core, the average reduction is 32%.
AB - Multiprocessor architectures that continuously execute atomic blocks (or chunks) of instructions can improve performance and software productivity. However, all of the prior proposals for such architectures assume single-context cores as building blocks - rather than the widely-used Simultaneous Multithreading (SMT) cores. As a result, they are underutilizing hardware resources. This paper presents the first SMT design that supports continuous chunked (or transactional) execution of its contexts. Our design, called BulkSMT, can be used either in a single-core processor or in a multicore of SMTs. We present a set of BulkSMT configurations with different cost and performance. We also describe the architectural primitives that enable chunked execution in an SMT core and in a multicore of SMTs. Our results, based on simulations of SPLASH-2 and PARSEC codes, show that BulkSMT supports chunked execution cost-effectively. In a 4-core multicore with eager chunked execution, BulkSMT reduces the execution time of the applications by an average of 26% compared to running on single-context cores. In a single core, the average reduction is 32%.
UR - http://www.scopus.com/inward/record.url?scp=84860321705&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84860321705&partnerID=8YFLogxK
U2 - 10.1109/HPCA.2012.6168952
DO - 10.1109/HPCA.2012.6168952
M3 - Conference contribution
AN - SCOPUS:84860321705
SN - 9781467308243
T3 - Proceedings - International Symposium on High-Performance Computer Architecture
SP - 153
EP - 164
BT - Proceedings - 18th IEEE International Symposium on High Performance Computer Architecture, HPCA - 18 2012
Y2 - 25 February 2012 through 29 February 2012
ER -