TY - GEN
T1 - Can ChatGPT Repair Non-Order-Dependent Flaky Tests?
AU - Chen, Yang
AU - Jabbarvand, Reyhaneh
N1 - Publisher Copyright:
© 2024 Copyright is held by the owner/author(s). Publication rights licensed to ACM.
PY - 2024/4/14
Y1 - 2024/4/14
N2 - Regression testing helps developers check whether the latest code changes break software functionality. Flaky tests, which can non-deterministically pass or fail on the same code version, may mislead developers' concerns, resulting in missing some bugs or spending time pinpointing bugs that do not exist. Existing flakiness detection and mitigation techniques have primarily focused on general order-dependent (OD) and implementation-dependent (ID) flaky tests. There is also a dearth of research on repairing test flakiness, out of which, mostly have focused on repairing OD flaky tests, and a few have explored repairing a subcategory of non-order-dependent (NOD) flaky tests that are caused by asynchronous waits. As a result, there is a demand for devising techniques to reproduce, detect, and repair NOD flaky tests. Large language models (LLMs) have shown great effectiveness in several programming tasks. To explore the potential of LLMs in addressing NOD flakiness, this paper investigates the possibility of using ChatGPT to repair different categories of NOD flaky tests. Our comprehensive study on 118 from the IDoFT dataset shows that ChatGPT, despite as a leading LLM with notable success in multiple code generation tasks, is ineffective in repairing NOD test flakiness, even by following the best practices for prompt crafting. We investigated the reasons behind the failure of using ChatGPT in repairing NOD tests, which provided us valuable insights about the next step to advance the field of NOD test flakiness repair.
AB - Regression testing helps developers check whether the latest code changes break software functionality. Flaky tests, which can non-deterministically pass or fail on the same code version, may mislead developers' concerns, resulting in missing some bugs or spending time pinpointing bugs that do not exist. Existing flakiness detection and mitigation techniques have primarily focused on general order-dependent (OD) and implementation-dependent (ID) flaky tests. There is also a dearth of research on repairing test flakiness, out of which, mostly have focused on repairing OD flaky tests, and a few have explored repairing a subcategory of non-order-dependent (NOD) flaky tests that are caused by asynchronous waits. As a result, there is a demand for devising techniques to reproduce, detect, and repair NOD flaky tests. Large language models (LLMs) have shown great effectiveness in several programming tasks. To explore the potential of LLMs in addressing NOD flakiness, this paper investigates the possibility of using ChatGPT to repair different categories of NOD flaky tests. Our comprehensive study on 118 from the IDoFT dataset shows that ChatGPT, despite as a leading LLM with notable success in multiple code generation tasks, is ineffective in repairing NOD test flakiness, even by following the best practices for prompt crafting. We investigated the reasons behind the failure of using ChatGPT in repairing NOD tests, which provided us valuable insights about the next step to advance the field of NOD test flakiness repair.
KW - large language models
KW - software testing
KW - test flakiness
UR - http://www.scopus.com/inward/record.url?scp=85203144515&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85203144515&partnerID=8YFLogxK
U2 - 10.1145/3643656.3643900
DO - 10.1145/3643656.3643900
M3 - Conference contribution
AN - SCOPUS:85203144515
T3 - Proceedings - 2024 IEEE/ACM International Flaky Tests Workshop, FTW 2024
SP - 22
EP - 29
BT - Proceedings - 2024 IEEE/ACM International Flaky Tests Workshop, FTW 2024
PB - Association for Computing Machinery
T2 - 1st International Flaky Tests Workshop, FTW 2024, co-located with the 46th ACM/IEEE International Conference on Software Engineering, ICSE 2024
Y2 - 14 April 2024
ER -