SyntacticDiff: Operator-based transformation for comparative text mining

Sean Massung, Chengxiang Zhai

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

We describe SyntacticDiff, a novel, general, and efficient edit-based method for transforming sequences of words given a reference text collection. These transformations can be used directly or can be employed as features to represent text data in a wide variety of text mining applications. As case studies, we apply SyntacticDiff to three quite different tasks, including grammatical error correction, student essay clustering and analysis, and native language identification, showing its benefit in each case. SyntacticDiff is completely general and can thus be potentially applied to any text data in any natural language. It is highly efficient, customizable, and able to capture syntactic differences from a reference text collection at the sentence, document, and subcollection levels. This enables both a rich translation method and feature representation for many text mining tasks that deal with word usage and syntax beyond bag-of-words.

Original languageEnglish (US)
Title of host publicationProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015
EditorsFeng Luo, Kemafor Ogan, Mohammed J. Zaki, Laura Haas, Beng Chin Ooi, Vipin Kumar, Sudarsan Rachuri, Saumyadipta Pyne, Howard Ho, Xiaohua Hu, Shipeng Yu, Morris Hui-I Hsiao, Jian Li
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages571-580
Number of pages10
ISBN (Electronic)9781479999255
DOIs
StatePublished - Dec 22 2015
Event3rd IEEE International Conference on Big Data, IEEE Big Data 2015 - Santa Clara, United States
Duration: Oct 29 2015Nov 1 2015

Publication series

NameProceedings - 2015 IEEE International Conference on Big Data, IEEE Big Data 2015

Other

Other3rd IEEE International Conference on Big Data, IEEE Big Data 2015
CountryUnited States
CitySanta Clara
Period10/29/1511/1/15

Keywords

  • Comparative Text Mining
  • Monolingual Translation. Corpus Summarization
  • Text Categorization

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Computer Science Applications
  • Information Systems
  • Software

Fingerprint Dive into the research topics of 'SyntacticDiff: Operator-based transformation for comparative text mining'. Together they form a unique fingerprint.

Cite this