TY - GEN
T1 - Dynamic Tuning of Core Counts to Maximize Performance in Object-Based Runtime Systems
AU - Chandrasekar, Kavitha
AU - Kale, Laxmikant V.
N1 - Publisher Copyright:
© The Author(s), under exclusive license to Springer Nature Switzerland AG 2024.
PY - 2024
Y1 - 2024
N2 - Relatively recent developments in supercomputer nodes, such as higher physical and virtual core counts per node, aim to speed up HPC application execution time. However, not all applications benefit from increased thread level parallelism and may exhibit performance degradation with increased concurrency. Additionally, the best performing thread count may not be known apriori as it can vary with application or with input size for a given application. This motivates the need for dynamically tuning the number of threads or cores used by an application, at run-time. However, such tuning of core counts in popular object-based or task-based runtime system is non-trivial since objects or tasks are anchored to processing elements (PEs) for locality. In this work, we identify the steps for adaptive tuning of core count to the most performant configuration, at run-time, for an object-based runtime system, Charm++. We show performance benefit of dynamic profiling and adaptively selecting core (physical or virtual) count for a variety of applications including compute, memory and cache-intensive applications. Specifically, we show that our mechanism can improve performance by almost 40% in presence of cache and memory contention, by over 20% with SMT in Skylake nodes and by about 35% in KNL nodes. We also show energy savings, and in some cases power savings alongside performance improvement.
AB - Relatively recent developments in supercomputer nodes, such as higher physical and virtual core counts per node, aim to speed up HPC application execution time. However, not all applications benefit from increased thread level parallelism and may exhibit performance degradation with increased concurrency. Additionally, the best performing thread count may not be known apriori as it can vary with application or with input size for a given application. This motivates the need for dynamically tuning the number of threads or cores used by an application, at run-time. However, such tuning of core counts in popular object-based or task-based runtime system is non-trivial since objects or tasks are anchored to processing elements (PEs) for locality. In this work, we identify the steps for adaptive tuning of core count to the most performant configuration, at run-time, for an object-based runtime system, Charm++. We show performance benefit of dynamic profiling and adaptively selecting core (physical or virtual) count for a variety of applications including compute, memory and cache-intensive applications. Specifically, we show that our mechanism can improve performance by almost 40% in presence of cache and memory contention, by over 20% with SMT in Skylake nodes and by about 35% in KNL nodes. We also show energy savings, and in some cases power savings alongside performance improvement.
UR - http://www.scopus.com/inward/record.url?scp=85197308811&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85197308811&partnerID=8YFLogxK
U2 - 10.1007/978-3-031-61763-8_9
DO - 10.1007/978-3-031-61763-8_9
M3 - Conference contribution
AN - SCOPUS:85197308811
SN - 9783031617621
T3 - Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
SP - 92
EP - 104
BT - Asynchronous Many-Task Systems and Applications - 2nd International Workshop, WAMTA 2024, Proceedings
A2 - Diehl, Patrick
A2 - Schuchart, Joseph
A2 - Valero-Lara, Pedro
A2 - Bosilca, George
PB - Springer
T2 - 2nd International Workshop on Asynchronous Many-Task Systems and Applications, WAMTA 2024
Y2 - 14 February 2024 through 16 February 2024
ER -