TY - GEN
T1 - Baldur
T2 - 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering, ESEC/FSE 2023
AU - First, Emily
AU - Rabe, Markus
AU - Ringer, Talia
AU - Brun, Yuriy
N1 - This work was performed at Google., Inc. We thank Stella Biderman, Ernest Davis, and others who provided feedback on an earlier draft of this paper. This work is supported by the Defense Advanced Research Projects Agency under grant no. DARPA HR0011-22-9-0063, and by the National Science Foundation under grant no. CCF-2210243.
PY - 2023/11/30
Y1 - 2023/11/30
N2 - Formally verifying software is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language and code and fine-tuned on proofs, to generate whole proofs at once. We then demonstrate that a model fine-tuned to repair generated proofs further increasing proving power. This paper: (1) Demonstrates that whole-proof generation using transformers is possible and is as effective but more efficient than search-based techniques. (2) Demonstrates that giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair that further improves automated proof generation. (3) Establishes, together with prior work, a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs, empirically showing the effectiveness of whole-proof generation, repair, and added context. We also show that Baldur complements the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.
AB - Formally verifying software is a highly desirable but labor-intensive task. Recent work has developed methods to automate formal verification using proof assistants, such as Coq and Isabelle/HOL, e.g., by training a model to predict one proof step at a time and using that model to search through the space of possible proofs. This paper introduces a new method to automate formal verification: We use large language models, trained on natural language and code and fine-tuned on proofs, to generate whole proofs at once. We then demonstrate that a model fine-tuned to repair generated proofs further increasing proving power. This paper: (1) Demonstrates that whole-proof generation using transformers is possible and is as effective but more efficient than search-based techniques. (2) Demonstrates that giving the learned model additional context, such as a prior failed proof attempt and the ensuing error message, results in proof repair that further improves automated proof generation. (3) Establishes, together with prior work, a new state of the art for fully automated proof synthesis. We reify our method in a prototype, Baldur, and evaluate it on a benchmark of 6,336 Isabelle/HOL theorems and their proofs, empirically showing the effectiveness of whole-proof generation, repair, and added context. We also show that Baldur complements the state-of-the-art tool, Thor, by automatically generating proofs for an additional 8.7% of the theorems. Together, Baldur and Thor can prove 65.7% of the theorems fully automatically. This paper paves the way for new research into using large language models for automating formal verification.
KW - Proof assistants
KW - automated formal verification
KW - large language models
KW - machine learning
KW - proof repair
KW - proof synthesis
UR - http://www.scopus.com/inward/record.url?scp=85165657273&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85165657273&partnerID=8YFLogxK
U2 - 10.1145/3611643.3616243
DO - 10.1145/3611643.3616243
M3 - Conference contribution
AN - SCOPUS:85165657273
T3 - ESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
SP - 1229
EP - 1241
BT - ESEC/FSE 2023 - Proceedings of the 31st ACM Joint Meeting European Software Engineering Conference and Symposium on the Foundations of Software Engineering
A2 - Chandra, Satish
A2 - Blincoe, Kelly
A2 - Tonella, Paolo
PB - Association for Computing Machinery
Y2 - 3 December 2023 through 9 December 2023
ER -