TY - GEN
T1 - Balancing Effectiveness and Flakiness of Non-Deterministic Machine Learning Tests
AU - Xia, Chunqiu Steven
AU - Dutta, Saikat
AU - Misailovic, Sasa
AU - Marinov, Darko
AU - Zhang, Lingming
N1 - Publisher Copyright:
© 2023 IEEE.
PY - 2023
Y1 - 2023
N2 - Testing Machine Learning (ML) projects is challenging due to inherent non-determinism of various ML algorithms and the lack of reliable ways to compute reference results. Developers typically rely on their intuition when writing tests to check whether ML algorithms produce accurate results. However, this approach leads to conservative choices in selecting assertion bounds for comparing actual and expected results in test assertions. Because developers want to avoid false positive failures in tests, they often set the bounds to be too loose, potentially leading to missing critical bugs. We present FASER - the first systematic approach for balancing the trade-off between the fault-detection effectiveness and flakiness of non-deterministic tests by computing optimal assertion bounds. FASER frames this trade-off as an optimization problem between these competing objectives by varying the assertion bound. FASER leverages 1) statistical methods to estimate the flakiness rate, and 2) mutation testing to estimate the fault-detection effectiveness. We evaluate FASER on 87 non-deterministic tests collected from 22 popular ML projects. FASER finds that 23 out of 87 studied tests have conservative bounds and proposes tighter assertion bounds that maximizes the fault-detection effectiveness of the tests while limiting flakiness. We have sent 19 pull requests to developers, each fixing one test, out of which 14 pull requests have already been accepted.
AB - Testing Machine Learning (ML) projects is challenging due to inherent non-determinism of various ML algorithms and the lack of reliable ways to compute reference results. Developers typically rely on their intuition when writing tests to check whether ML algorithms produce accurate results. However, this approach leads to conservative choices in selecting assertion bounds for comparing actual and expected results in test assertions. Because developers want to avoid false positive failures in tests, they often set the bounds to be too loose, potentially leading to missing critical bugs. We present FASER - the first systematic approach for balancing the trade-off between the fault-detection effectiveness and flakiness of non-deterministic tests by computing optimal assertion bounds. FASER frames this trade-off as an optimization problem between these competing objectives by varying the assertion bound. FASER leverages 1) statistical methods to estimate the flakiness rate, and 2) mutation testing to estimate the fault-detection effectiveness. We evaluate FASER on 87 non-deterministic tests collected from 22 popular ML projects. FASER finds that 23 out of 87 studied tests have conservative bounds and proposes tighter assertion bounds that maximizes the fault-detection effectiveness of the tests while limiting flakiness. We have sent 19 pull requests to developers, each fixing one test, out of which 14 pull requests have already been accepted.
UR - http://www.scopus.com/inward/record.url?scp=85171793660&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85171793660&partnerID=8YFLogxK
U2 - 10.1109/ICSE48619.2023.00154
DO - 10.1109/ICSE48619.2023.00154
M3 - Conference contribution
AN - SCOPUS:85171793660
T3 - Proceedings - International Conference on Software Engineering
SP - 1801
EP - 1813
BT - Proceedings - 2023 IEEE/ACM 45th International Conference on Software Engineering, ICSE 2023
PB - IEEE Computer Society
T2 - 45th IEEE/ACM International Conference on Software Engineering, ICSE 2023
Y2 - 15 May 2023 through 16 May 2023
ER -