Neural detection of semantic code clones via tree-based convolution

Hao Yu, Wing Lam, Long Chen, Ge Li, Tao Xie, Qianxiang Wang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Code clones are similar code fragments that share the same semantics but may differ syntactically to various degrees. Detecting code clones helps reduce the cost of software maintenance and prevent faults. Various approaches of detecting code clones have been proposed over the last two decades, but few of them can detect semantic clones, i.e., code clones with dissimilar syntax. Recent research has attempted to adopt deep learning for detecting code clones, such as using tree-based LSTM over Abstract Syntax Tree (AST). However, it does not fully leverage the structural information of code fragments, thereby limiting its clone-detection capability. To fully unleash the power of deep learning for detecting code clones, we propose a new approach that uses tree-based convolution to detect semantic clones, by capturing both the structural information of a code fragment from its AST and lexical information from code tokens. Additionally, our approach addresses the limitation that source code has an unlimited vocabulary of tokens and models, and thus exploiting lexical information from code tokens is often ineffective when dealing with unseen tokens. Particularly, we propose a new embedding technique called position-aware character embedding (PACE), which essentially treats any token as a position-weighted combination of character one-hot embeddings. Our experimental results show that our approach substantially outperforms an existing state-of-the-art approach with an increase of 0.42 and 0.15 in F1-score on two popular code-clone benchmarks (OJClone and BigCloneBench), respectively, while being more computationally efficient. Our experimental results also show that PACE enables our approach to be substantially more effective when code clones contain unseen tokens.

Original languageEnglish (US)
Title of host publicationProceedings - 2019 IEEE/ACM 27th International Conference on Program Comprehension, ICPC 2019
PublisherIEEE Computer Society
Pages70-80
Number of pages11
ISBN (Electronic)9781728115191
DOIs
StatePublished - May 2019
Event27th IEEE/ACM International Conference on Program Comprehension, ICPC 2019 - Montreal, Canada
Duration: May 25 2019 → …

Publication series

NameIEEE International Conference on Program Comprehension
Volume2019-May

Conference

Conference27th IEEE/ACM International Conference on Program Comprehension, ICPC 2019
CountryCanada
CityMontreal
Period5/25/19 → …

Keywords

  • AST
  • Clone detection
  • Embedding
  • Generalization
  • Lexical information
  • Semantic clone
  • Source code
  • Structural information
  • Token
  • Tree-based convolution

ASJC Scopus subject areas

  • Hardware and Architecture
  • Software

Fingerprint Dive into the research topics of 'Neural detection of semantic code clones via tree-based convolution'. Together they form a unique fingerprint.

Cite this