TY - GEN
T1 - Minimizing the usage of hardware counters for collective communication using triggered operations
AU - Islam, Nusrat Sharmin
AU - Zheng, Gengbin
AU - Sur, Sayantan
AU - Langer, Akhil
AU - Garzaran, Maria
N1 - Publisher Copyright:
© 2019 Association for Computing Machinery.
PY - 2019/9/11
Y1 - 2019/9/11
N2 - Triggered operations and counting events or counters are building blocks that can be used by communication libraries, such as MPI, to offload collective operations to the Host Fabric Interface (HFI) or Network Interface Card (NIC). Triggered operations can be used to schedule a network or arithmetic operation to occur in the future, when a trigger counter reaches a specified threshold. On completion of the operation, the value of a completion counter increases by one. With this mechanism, it is possible to create a chain of dependent operations, so that the execution of an operation is triggered when all its dependent operations have completed its execution. Triggered operations rely on hardware counters on the HFI and are a limited resource. Thus, if the number of required counters exceeds the number of hardware counters, a collective needs to stall until a previous collective completes and counters are released. In addition, if the HFI has a counter cache, utilizing a large number of counters can cause cache thrashing and provide poor performance. Therefore, it is important to reduce the number of counters, specially when running on a large supercomputer or when an application uses non-blocking collectives and multiple collectives can run concurrently. In this paper, we propose an algorithm to optimize the number of hardware counters used when offloading collectives with triggered operations. With our algorithm, different operations can share and re-use trigger and completion counters based on the dependences among them and their topological orderings. Our experimental results show that our proposed algorithm significantly reduces the number of counters over a default approach that does not consider the dependences among the operations.
AB - Triggered operations and counting events or counters are building blocks that can be used by communication libraries, such as MPI, to offload collective operations to the Host Fabric Interface (HFI) or Network Interface Card (NIC). Triggered operations can be used to schedule a network or arithmetic operation to occur in the future, when a trigger counter reaches a specified threshold. On completion of the operation, the value of a completion counter increases by one. With this mechanism, it is possible to create a chain of dependent operations, so that the execution of an operation is triggered when all its dependent operations have completed its execution. Triggered operations rely on hardware counters on the HFI and are a limited resource. Thus, if the number of required counters exceeds the number of hardware counters, a collective needs to stall until a previous collective completes and counters are released. In addition, if the HFI has a counter cache, utilizing a large number of counters can cause cache thrashing and provide poor performance. Therefore, it is important to reduce the number of counters, specially when running on a large supercomputer or when an application uses non-blocking collectives and multiple collectives can run concurrently. In this paper, we propose an algorithm to optimize the number of hardware counters used when offloading collectives with triggered operations. With our algorithm, different operations can share and re-use trigger and completion counters based on the dependences among them and their topological orderings. Our experimental results show that our proposed algorithm significantly reduces the number of counters over a default approach that does not consider the dependences among the operations.
UR - http://www.scopus.com/inward/record.url?scp=85075869764&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85075869764&partnerID=8YFLogxK
U2 - 10.1145/3343211.3343222
DO - 10.1145/3343211.3343222
M3 - Conference contribution
AN - SCOPUS:85075869764
T3 - ACM International Conference Proceeding Series
BT - Proceedings of the 26th European MPI Users'' Group Meeting, EuroMPI 2019
A2 - Hoefler, Torsten
A2 - Traff, Jesper Larsson
PB - Association for Computing Machinery
T2 - 26th European MPI Users'' Group Meeting, EuroMPI 2019
Y2 - 11 September 2019 through 13 September 2019
ER -