TY - GEN
T1 - Work stealing and persistence-based load balancers for iterative overdecomposed applications
AU - Lifflander, Jonathan
AU - Krishnamoorthy, Sriram
AU - Kale, Laxmikant V.
PY - 2012
Y1 - 2012
N2 - Applications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two distinct approaches to addressing this challenge: persistence-based load balancing and work stealing. The work to be performed is overdecomposed into tasks, enabling automatic rebalancing by the middleware. We present a hierarchical persistence-based rebalancing algorithm that performs localized incremental rebalancing. We also present an active-message-based retentive work stealing algorithm optimized for iterative applications on distributed memory machines. We demonstrate low overheads and high efficiencies on the full NERSC Hopper (146,400 cores) and ALCF Intrepid systems (163,840 cores), and on up to 128,000 cores on OLCF Titan.
AB - Applications often involve iterative execution of identical or slowly evolving calculations. Such applications require incremental rebalancing to improve load balance across iterations. In this paper, we consider the design and evaluation of two distinct approaches to addressing this challenge: persistence-based load balancing and work stealing. The work to be performed is overdecomposed into tasks, enabling automatic rebalancing by the middleware. We present a hierarchical persistence-based rebalancing algorithm that performs localized incremental rebalancing. We also present an active-message-based retentive work stealing algorithm optimized for iterative applications on distributed memory machines. We demonstrate low overheads and high efficiencies on the full NERSC Hopper (146,400 cores) and ALCF Intrepid systems (163,840 cores), and on up to 128,000 cores on OLCF Titan.
KW - Dynamic load balancing
KW - Hierarchical load balancer
KW - Iterative applications
KW - Persistence
KW - Task scheduling
KW - Work stealing
UR - http://www.scopus.com/inward/record.url?scp=84863959775&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84863959775&partnerID=8YFLogxK
U2 - 10.1145/2287076.2287103
DO - 10.1145/2287076.2287103
M3 - Conference contribution
AN - SCOPUS:84863959775
SN - 9781450308052
T3 - HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing
SP - 137
EP - 148
BT - HPDC '12 - Proceedings of the 21st ACM Symposium on High-Performance Parallel and Distributed Computing
T2 - 21st ACM Symposium on High-Performance Parallel and Distributed Computing, HPDC '12
Y2 - 18 June 2012 through 22 June 2012
ER -