TY - JOUR
T1 - QUDA programming for staggered quarks
AU - Gottlieb, Steven
AU - Shi, Guochun
AU - Torok, Aaron
AU - Kindratenko, Volodymyr
N1 - Publisher Copyright:
© Copyright owned by the author(s) under the terms of the Creative Commons Attribution-NonCommercial-ShareAlike Licence.
PY - 2010
Y1 - 2010
N2 - We have been extending the QUDA GPU code developed at Boston University to include the case of improved staggered quarks. Improved staggered quarks such as asqtad and HISQ require both first and third nearest neighbor terms in the Dirac operator. We call the corresponding links fatlinks and longlinks. The fatlinks are not unitary, and staggered phases are included in the links, so link reconstruction techniques may either be inapplicable or require modification. A single precision inverter using compressed storage for the longlinks achieves a speed of 100 GF/s on an NVIDIA GTX 280 GPU on a 243 × 32 lattice. In addition to the inverter code, we have code for fatlink computation, gauge force and fermion force. They run at 170, 186 and 107 GF/s, respectively, for similar conditions to the solver speed above. The single GPU code is currently in production on NCSA's AC cluster for the study of electromagnetic effects. The double precision multimass solver is running at 20 GF/s, about 80% of the speed of an 8-node or 64-core job on Fermilab's jpsi cluster. The AC cluster has C1060 Tesla boards with lower memory bandwidth than the GTX 280, where the DP inverter runs at 33 GF/s. Multi-GPU code is in development.
AB - We have been extending the QUDA GPU code developed at Boston University to include the case of improved staggered quarks. Improved staggered quarks such as asqtad and HISQ require both first and third nearest neighbor terms in the Dirac operator. We call the corresponding links fatlinks and longlinks. The fatlinks are not unitary, and staggered phases are included in the links, so link reconstruction techniques may either be inapplicable or require modification. A single precision inverter using compressed storage for the longlinks achieves a speed of 100 GF/s on an NVIDIA GTX 280 GPU on a 243 × 32 lattice. In addition to the inverter code, we have code for fatlink computation, gauge force and fermion force. They run at 170, 186 and 107 GF/s, respectively, for similar conditions to the solver speed above. The single GPU code is currently in production on NCSA's AC cluster for the study of electromagnetic effects. The double precision multimass solver is running at 20 GF/s, about 80% of the speed of an 8-node or 64-core job on Fermilab's jpsi cluster. The AC cluster has C1060 Tesla boards with lower memory bandwidth than the GTX 280, where the DP inverter runs at 33 GF/s. Multi-GPU code is in development.
UR - http://www.scopus.com/inward/record.url?scp=83155163592&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=83155163592&partnerID=8YFLogxK
M3 - Conference article
AN - SCOPUS:83155163592
SN - 1824-8039
VL - 105
JO - Proceedings of Science
JF - Proceedings of Science
T2 - 28th International Symposium on Lattice Field Theory, Lattice 2010
Y2 - 14 June 2010 through 19 June 2010
ER -