Authorship classification: A syntactic tree mining approach

Sangkyum Kim, Hyungsul Kim, Tim Weninger, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In the past, there have been dozens of studies on automatic authorship classification, and many of these studies concluded that the writing style is one of the best indicators of original authorship. From among the hundreds of features which were developed, syntactic features were best able to reflect an author's writing style. However, due to the high computational complexity of extracting and computing syntactic features, only simple variations of basic syntactic features of function words and part-of-speech tags were considered. In this paper, we propose a novel approach to mining discriminative k-embedded-edge subtree patterns from a given set of syntactic trees that reduces the computational burden of using complex syntactic structures as a feature set. This method is shown to increase the classification accuracy. We also design a new kernel based on these features. Comprehensive experiments on real datasets of news articles and movie reviews demonstrate that our approach is reliable and more accurate than previous studies.

Original languageEnglish (US)
Title of host publicationProceedings of the ACM SIGKDD Workshop on Useful Patterns, UP'10, in Conjunction with the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Pages65-73
Number of pages9
DOIs
StatePublished - Sep 8 2010
EventACM SIGKDD Workshop on Useful Patterns, UP'10, in Conjunction with the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining - Washington, DC, United States
Duration: Jul 25 2010Jul 25 2010

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Other

OtherACM SIGKDD Workshop on Useful Patterns, UP'10, in Conjunction with the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Country/TerritoryUnited States
CityWashington, DC
Period7/25/107/25/10

Keywords

  • Authorship classification
  • Closed pattern
  • Discriminative pattern
  • Text categorization
  • Text mining

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Authorship classification: A syntactic tree mining approach'. Together they form a unique fingerprint.

Cite this