SemRegex: A semantics-based approach for generating regular expressions from natural language specifications

Zexuan Zhong, Jiaqi Guo, Wei Yang, Jian Peng, Tao Xie, Jian Guang Lou, Ting Liu, Dongmei Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Recent research proposes syntax-based approaches to address the problem of generating programs from natural language specifications. These approaches typically train a sequence-to-sequence learning model using a syntax-based objective: maximum likelihood estimation (MLE). Such syntax-based approaches do not effectively address the goal of generating semantically correct programs, because these approaches fail to handle Program Aliasing, i.e., semantically equivalent programs may have many syntactically different forms. To address this issue, in this paper, we propose a semantics-based approach named SemRegex. SemRegex provides solutions for a subtask of the program-synthesis problem: generating regular expressions from natural language. Different from the existing syntax-based approaches, SemRegex trains the model by maximizing the expected semantic correctness of the generated regular expressions. The semantic correctness is measured using the DFA-equivalence oracle, random test cases, and distinguishing test cases. The experiments on three public datasets demonstrate the superiority of SemRegex over the existing state-of-the-art approaches.

Original languageEnglish (US)
Title of host publicationProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
EditorsEllen Riloff, David Chiang, Julia Hockenmaier, Jun'ichi Tsujii
PublisherAssociation for Computational Linguistics
Pages1608-1618
Number of pages11
ISBN (Electronic)9781948087841
StatePublished - Jan 1 2020
Event2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 - Brussels, Belgium
Duration: Oct 31 2018Nov 4 2018

Publication series

NameProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018

Conference

Conference2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018
CountryBelgium
CityBrussels
Period10/31/1811/4/18

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Information Systems

Fingerprint Dive into the research topics of 'SemRegex: A semantics-based approach for generating regular expressions from natural language specifications'. Together they form a unique fingerprint.

  • Cite this

    Zhong, Z., Guo, J., Yang, W., Peng, J., Xie, T., Lou, J. G., Liu, T., & Zhang, D. (2020). SemRegex: A semantics-based approach for generating regular expressions from natural language specifications. In E. Riloff, D. Chiang, J. Hockenmaier, & J. Tsujii (Eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018 (pp. 1608-1618). (Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP 2018). Association for Computational Linguistics.