TY - GEN
T1 - Neural detection of semantic code clones via tree-based convolution
AU - Yu, Hao
AU - Lam, Wing
AU - Chen, Long
AU - Li, Ge
AU - Xie, Tao
AU - Wang, Qianxiang
PY - 2019/5
Y1 - 2019/5
N2 - Code clones are similar code fragments that share the same semantics but may differ syntactically to various degrees. Detecting code clones helps reduce the cost of software maintenance and prevent faults. Various approaches of detecting code clones have been proposed over the last two decades, but few of them can detect semantic clones, i.e., code clones with dissimilar syntax. Recent research has attempted to adopt deep learning for detecting code clones, such as using tree-based LSTM over Abstract Syntax Tree (AST). However, it does not fully leverage the structural information of code fragments, thereby limiting its clone-detection capability. To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens. Additionally, our approach addresses the limitation that source code has an unlimited vocabulary of tokens and models, and thus exploiting lexical information from code tokens is often ineffective when dealing with unseen tokens. Particularly, we propose a new embedding technique called position-aware character embedding (PACE), which essentially treats any token as a position-weighted combination of character one-hot embeddings. Our experimental results show that our approach substantially outperforms an existing state-of-the-art approach with an increase of 0.42 and 0.15 in F1-score on two popular code-clone benchmarks (OJClone and BigCloneBench), respectively, while being more computationally efficient. Our experimental results also show that PACE enables our approach to be substantially more effective when code clones contain unseen tokens.
AB - Code clones are similar code fragments that share the same semantics but may differ syntactically to various degrees. Detecting code clones helps reduce the cost of software maintenance and prevent faults. Various approaches of detecting code clones have been proposed over the last two decades, but few of them can detect semantic clones, i.e., code clones with dissimilar syntax. Recent research has attempted to adopt deep learning for detecting code clones, such as using tree-based LSTM over Abstract Syntax Tree (AST). However, it does not fully leverage the structural information of code fragments, thereby limiting its clone-detection capability. To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens. Additionally, our approach addresses the limitation that source code has an unlimited vocabulary of tokens and models, and thus exploiting lexical information from code tokens is often ineffective when dealing with unseen tokens. Particularly, we propose a new embedding technique called position-aware character embedding (PACE), which essentially treats any token as a position-weighted combination of character one-hot embeddings. Our experimental results show that our approach substantially outperforms an existing state-of-the-art approach with an increase of 0.42 and 0.15 in F1-score on two popular code-clone benchmarks (OJClone and BigCloneBench), respectively, while being more computationally efficient. Our experimental results also show that PACE enables our approach to be substantially more effective when code clones contain unseen tokens.
KW - AST
KW - Clone detection
KW - Embedding
KW - Generalization
KW - Lexical information
KW - Semantic clone
KW - Source code
KW - Structural information
KW - Token
KW - Tree-based convolution
UR - http://www.scopus.com/inward/record.url?scp=85066009947&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85066009947&partnerID=8YFLogxK
U2 - 10.1109/ICPC.2019.00021
DO - 10.1109/ICPC.2019.00021
M3 - Conference contribution
AN - SCOPUS:85066009947
T3 - IEEE International Conference on Program Comprehension
SP - 70
EP - 80
BT - Proceedings - 2019 IEEE/ACM 27th International Conference on Program Comprehension, ICPC 2019
PB - IEEE Computer Society
T2 - 27th IEEE/ACM International Conference on Program Comprehension, ICPC 2019
Y2 - 25 May 2019
ER -