TY - GEN
T1 - Detecting flaky tests in probabilistic and machine learning applications
AU - Dutta, Saikat
AU - Shi, August
AU - Choudhary, Rutvik
AU - Zhang, Zhekun
AU - Jain, Aryaman
AU - Misailovic, Sasa
N1 - Publisher Copyright:
© 2020 ACM.
PY - 2020/7/18
Y1 - 2020/7/18
N2 - Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests - tests which fail non-deterministically when run on the same version of code. In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain. Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.
AB - Probabilistic programming systems and machine learning frameworks like Pyro, PyMC3, TensorFlow, and PyTorch provide scalable and efficient primitives for inference and training. However, such operations are non-deterministic. Hence, it is challenging for developers to write tests for applications that depend on such frameworks, often resulting in flaky tests - tests which fail non-deterministically when run on the same version of code. In this paper, we conduct the first extensive study of flaky tests in this domain. In particular, we study the projects that depend on four frameworks: Pyro, PyMC3, TensorFlow-Probability, and PyTorch. We identify 75 bug reports/commits that deal with flaky tests, and we categorize the common causes and fixes for them. This study provides developers with useful insights on dealing with flaky tests in this domain. Motivated by our study, we develop a technique, FLASH, to systematically detect flaky tests due to assertions passing and failing in different runs on the same code. These assertions fail due to differences in the sequence of random numbers in different runs of the same test. FLASH exposes such failures, and our evaluation on 20 projects results in 11 previously-unknown flaky tests that we reported to developers.
KW - Flaky tests
KW - Machine Learning
KW - Non-Determinism
KW - Probabilistic Programming
KW - Randomness
UR - http://www.scopus.com/inward/record.url?scp=85088916641&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85088916641&partnerID=8YFLogxK
U2 - 10.1145/3395363.3397366
DO - 10.1145/3395363.3397366
M3 - Conference contribution
AN - SCOPUS:85088916641
T3 - ISSTA 2020 - Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis
SP - 211
EP - 224
BT - ISSTA 2020 - Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis
A2 - Khurshid, Sarfraz
A2 - Pasareanu, Corina S.
PB - Association for Computing Machinery
T2 - 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, ISSTA 2020
Y2 - 18 July 2020 through 22 July 2020
ER -