TY - GEN
T1 - Combining Code Context and Fine-grained Code Difference for Commit Message Generation
AU - Xu, Shengbin
AU - Yao, Yuan
AU - Xu, Feng
AU - Gu, Tianxiao
AU - Tong, Hanghang
N1 - Publisher Copyright:
© 2022 Association for Computing Machinery.
PY - 2022/6/11
Y1 - 2022/6/11
N2 - Generating natural language messages for source code changes is an essential task in software development and maintenance. Existing solutions mainly treat a piece of code difference as natural language, and adopt seq2seq learning to translate it into a commit message. The basic assumption of such solutions lies in the naturalness hypothesis, i.e., source code written by programming languages is to some extent similar to natural language text. However, compared with natural language, source code also bears syntactic regularities. In this paper, we propose to simultaneously model the naturalness and syntactic regularities of source code changes for commit message generation. Specifically, to model syntactic regularities, we first enlarge the input with additional context information, i.e., the code statements that have dependency with the variables in the code difference, and then extract the paths in the corresponding ASTs. Moreover, to better model code difference, we align the two versions of code before and after the committed code change at token level, and annotate their differences with fine-grained edit operations. The context and difference are simultaneously encoded in a learning framework to generate the commit messages. We collected from GitHub a large dataset containing 480 Java projects with over 160k commits, and the experimental results demonstrate the effectiveness of the proposed approach.
AB - Generating natural language messages for source code changes is an essential task in software development and maintenance. Existing solutions mainly treat a piece of code difference as natural language, and adopt seq2seq learning to translate it into a commit message. The basic assumption of such solutions lies in the naturalness hypothesis, i.e., source code written by programming languages is to some extent similar to natural language text. However, compared with natural language, source code also bears syntactic regularities. In this paper, we propose to simultaneously model the naturalness and syntactic regularities of source code changes for commit message generation. Specifically, to model syntactic regularities, we first enlarge the input with additional context information, i.e., the code statements that have dependency with the variables in the code difference, and then extract the paths in the corresponding ASTs. Moreover, to better model code difference, we align the two versions of code before and after the committed code change at token level, and annotate their differences with fine-grained edit operations. The context and difference are simultaneously encoded in a learning framework to generate the commit messages. We collected from GitHub a large dataset containing 480 Java projects with over 160k commits, and the experimental results demonstrate the effectiveness of the proposed approach.
KW - Commit message generation
KW - code regularity
KW - software naturalness
UR - http://www.scopus.com/inward/record.url?scp=85139545690&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85139545690&partnerID=8YFLogxK
U2 - 10.1145/3545258.3545274
DO - 10.1145/3545258.3545274
M3 - Conference contribution
AN - SCOPUS:85139545690
T3 - ACM International Conference Proceeding Series
SP - 242
EP - 251
BT - 13th Asia-Pacific Symposium on Internetware, Internetware 2022 - Proceedings
PB - Association for Computing Machinery
T2 - 13th Asia-Pacific Symposium on Internetware, Internetware 2022
Y2 - 11 June 2022 through 12 June 2022
ER -