TY - JOUR
T1 - Exploiting mesh structure to improve multigrid performance for saddle-point problems
AU - Spies, Lukas
AU - Olson, Luke
AU - MacLachlan, Scott
N1 - The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work used the Delta system at the National Center for Supercomputing Applications through allocation CIS230037 from the Advanced Cyberinfrastructure Coordination Ecosystem: Services & Support (ACCESS) program, which is supported by National Science Foundation grants #2138259, #2138286, #2138307, #2137603, and #2138296. The work of Scott MacLachlan was partially supported by an NSERC Discovery Grant.
PY - 2025/3
Y1 - 2025/3
N2 - In recent years, solvers for finite-element discretizations of linear or linearized saddle-point problems, like the Stokes and Oseen equations, have become well established. There are two main classes of preconditioners for such systems: those based on a block-factorization approach and those based on monolithic multigrid. Both classes of preconditioners have several critical choices to be made in their composition, such as the selection of a suitable relaxation scheme for monolithic multigrid. From existing studies, some insight can be gained as to what options are preferable in low-performance computing settings, but there are very few fair comparisons of these approaches in the literature, particularly for modern architectures, such as GPUs. In this paper, we perform a comparison between a Block-Triangular preconditioner and monolithic multigrid methods with the three most common choices of relaxation scheme – Braess-Sarazin, Vanka, and Schur-Uzawa. We develop a performant Vanka relaxation algorithm for structured-grid discretizations, which takes advantage of memory efficiencies in this setting. We detail the behavior of the various CUDA kernels for the multigrid relaxation schemes and evaluate their individual arithmetic intensity, performance, and runtime. Running a preconditioned FGMRES solver for the Stokes equations with these preconditioners allows us to compare their efficiency in a practical setting. We show that monolithic multigrid can outperform Block-Triangular preconditioning, and that using Vanka or Braess-Sarazin relaxation is most efficient. Even though multigrid with Vanka relaxation exhibits reduced performance on the CPU (up to 100% slower than Braess-Sarazin), it is able to outperform Braess-Sarazin by more than 20% on the GPU, making it a competitive algorithm, especially given the high amount of algorithmic tuning needed for effective Braess-Sarazin relaxation.
AB - In recent years, solvers for finite-element discretizations of linear or linearized saddle-point problems, like the Stokes and Oseen equations, have become well established. There are two main classes of preconditioners for such systems: those based on a block-factorization approach and those based on monolithic multigrid. Both classes of preconditioners have several critical choices to be made in their composition, such as the selection of a suitable relaxation scheme for monolithic multigrid. From existing studies, some insight can be gained as to what options are preferable in low-performance computing settings, but there are very few fair comparisons of these approaches in the literature, particularly for modern architectures, such as GPUs. In this paper, we perform a comparison between a Block-Triangular preconditioner and monolithic multigrid methods with the three most common choices of relaxation scheme – Braess-Sarazin, Vanka, and Schur-Uzawa. We develop a performant Vanka relaxation algorithm for structured-grid discretizations, which takes advantage of memory efficiencies in this setting. We detail the behavior of the various CUDA kernels for the multigrid relaxation schemes and evaluate their individual arithmetic intensity, performance, and runtime. Running a preconditioned FGMRES solver for the Stokes equations with these preconditioners allows us to compare their efficiency in a practical setting. We show that monolithic multigrid can outperform Block-Triangular preconditioning, and that using Vanka or Braess-Sarazin relaxation is most efficient. Even though multigrid with Vanka relaxation exhibits reduced performance on the CPU (up to 100% slower than Braess-Sarazin), it is able to outperform Braess-Sarazin by more than 20% on the GPU, making it a competitive algorithm, especially given the high amount of algorithmic tuning needed for effective Braess-Sarazin relaxation.
KW - Braess-Sarazin
KW - GPU performance
KW - Monolithic multigrid
KW - Schur-Uzawa
KW - Vanka
KW - block-triangular preconditioners
KW - relaxation scheme
UR - http://www.scopus.com/inward/record.url?scp=105001553151&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=105001553151&partnerID=8YFLogxK
U2 - 10.1177/10943420241261989
DO - 10.1177/10943420241261989
M3 - Article
AN - SCOPUS:105001553151
SN - 1094-3420
VL - 39
SP - 211
EP - 229
JO - International Journal of High Performance Computing Applications
JF - International Journal of High Performance Computing Applications
IS - 2
ER -