TY - JOUR
T1 - Minimizing the usage of hardware counters for collective communication using triggered operations
AU - Islam, Nusrat Sharmin
AU - Zheng, Gengbin
AU - Sur, Sayantan
AU - Langer, Akhil
AU - Garzaran, Maria
N1 - This material is based upon work supported by the U.S. Department of Energy and Argonne National Laboratory and its Leadership Computing Facility under Award Number(s) DE-AC02-06CH11357 . This work was generated with financial support from the U.S. Government through said Contract and Award Number(s), and as such the U.S. Government retains a paid-up, nonexclusive, irrevocable, world-wide license to reproduce, prepare derivative works, distribute copies to the public, and display publicly, by or on behalf of the Government, this work in whole or in part, or otherwise use the work for Federal purposes.
PY - 2020/10
Y1 - 2020/10
N2 - Triggered operations and counting events or counters are building blocks used by communication libraries, such as MPI, to offload collective operations to the Host Fabric Interface (HFI) or Network Interface Card (NIC). Triggered operations can be used to schedule a network or arithmetic operation to occur in the future, when a trigger counter reaches a specified threshold. On completion of the operation, the value of a completion counter increases by one. With this mechanism, it is possible to create a chain of dependent operations, so that the execution of an operation is triggered when all the operations it depends on have completed its execution. Triggered operations rely on hardware counters on the HFI and are a limited resource. Thus, if the number of required counters exceeds the number of hardware counters, a collective needs to stall until a previous collective completes and counters are released. In addition, if the HFI has a counter cache, utilizing a large number of counters can cause cache thrashing and provide poor performance. Therefore, it is important to reduce the number of counters, especially when running on a large supercomputer or when an application uses non-blocking collectives and multiple collectives can run concurrently. Moreover, counters being a scarce resource, it is important for the MPI library to be able to estimate the number of counters required by a collective so that it can fallback to the software implementation when the number of available counters is less than the required number. In this paper, we propose an algorithm to optimize the number of hardware counters used when offloading collectives with triggered operations. With our algorithm, different operations can share and re-use trigger and completion counters based on the dependences among them and their topological orderings. We have also proposed models to estimate the number of counters required by different collectives when using the optimization algorithm. While the proposed counter optimization algorithm assumes that the dependences among various operations in a collective are represented using a Directed Acyclic Graph (DAG), there might be cases when no DAGs are provided for the collective. In this paper, we also discuss how we can optimize the usage of counters for such cases. Our experimental results show that our proposed algorithm significantly reduces the number of counters over a naïve approach that does not consider the dependences among the operations.
AB - Triggered operations and counting events or counters are building blocks used by communication libraries, such as MPI, to offload collective operations to the Host Fabric Interface (HFI) or Network Interface Card (NIC). Triggered operations can be used to schedule a network or arithmetic operation to occur in the future, when a trigger counter reaches a specified threshold. On completion of the operation, the value of a completion counter increases by one. With this mechanism, it is possible to create a chain of dependent operations, so that the execution of an operation is triggered when all the operations it depends on have completed its execution. Triggered operations rely on hardware counters on the HFI and are a limited resource. Thus, if the number of required counters exceeds the number of hardware counters, a collective needs to stall until a previous collective completes and counters are released. In addition, if the HFI has a counter cache, utilizing a large number of counters can cause cache thrashing and provide poor performance. Therefore, it is important to reduce the number of counters, especially when running on a large supercomputer or when an application uses non-blocking collectives and multiple collectives can run concurrently. Moreover, counters being a scarce resource, it is important for the MPI library to be able to estimate the number of counters required by a collective so that it can fallback to the software implementation when the number of available counters is less than the required number. In this paper, we propose an algorithm to optimize the number of hardware counters used when offloading collectives with triggered operations. With our algorithm, different operations can share and re-use trigger and completion counters based on the dependences among them and their topological orderings. We have also proposed models to estimate the number of counters required by different collectives when using the optimization algorithm. While the proposed counter optimization algorithm assumes that the dependences among various operations in a collective are represented using a Directed Acyclic Graph (DAG), there might be cases when no DAGs are provided for the collective. In this paper, we also discuss how we can optimize the usage of counters for such cases. Our experimental results show that our proposed algorithm significantly reduces the number of counters over a naïve approach that does not consider the dependences among the operations.
UR - http://www.scopus.com/inward/record.url?scp=85089287548&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85089287548&partnerID=8YFLogxK
U2 - 10.1016/j.parco.2020.102636
DO - 10.1016/j.parco.2020.102636
M3 - Article
AN - SCOPUS:85089287548
SN - 0167-8191
VL - 98
JO - Parallel Computing
JF - Parallel Computing
M1 - 102636
ER -