TY - GEN
T1 - Authorship classification
T2 - 34th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2011
AU - Kim, Sangkyum
AU - Kim, Hyungsul
AU - Weninger, Tim
AU - Han, Jiawei
AU - Kim, Hyun Duk
PY - 2011
Y1 - 2011
N2 - In the past, there have been dozens of studies on automatic authorship classification, and many of these studies concluded that the writing style is one of the best indicators for original authorship. From among the hundreds of features which were developed, syntactic features were best able to reflect an author's writing style. However, due to the high computational complexity for extracting and computing syntactic features, only simple variations of basic syntactic features such as function words, POS(Part of Speech) tags, and rewrite rules were considered. In this paper, we propose a new feature set of k-embedded-edge subtree patterns that holds more syntactic information than previous feature sets. We also propose a novel approach to directly mining them from a given set of syntactic trees. We show that this approach reduces the computational burden of using complex syntactic structures as the feature set. Comprehensive experiments on real-world datasets demonstrate that our approach is reliable and more accurate than previous studies.
AB - In the past, there have been dozens of studies on automatic authorship classification, and many of these studies concluded that the writing style is one of the best indicators for original authorship. From among the hundreds of features which were developed, syntactic features were best able to reflect an author's writing style. However, due to the high computational complexity for extracting and computing syntactic features, only simple variations of basic syntactic features such as function words, POS(Part of Speech) tags, and rewrite rules were considered. In this paper, we propose a new feature set of k-embedded-edge subtree patterns that holds more syntactic information than previous feature sets. We also propose a novel approach to directly mining them from a given set of syntactic trees. We show that this approach reduces the computational burden of using complex syntactic structures as the feature set. Comprehensive experiments on real-world datasets demonstrate that our approach is reliable and more accurate than previous studies.
KW - Authorship attribution
KW - Authorship classification
KW - Authorship discrimination
KW - Text categorization
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=80052132341&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=80052132341&partnerID=8YFLogxK
U2 - 10.1145/2009916.2009979
DO - 10.1145/2009916.2009979
M3 - Conference contribution
AN - SCOPUS:80052132341
SN - 9781450309349
T3 - SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
SP - 455
EP - 464
BT - SIGIR'11 - Proceedings of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval
PB - Association for Computing Machinery
Y2 - 24 July 2011 through 28 July 2011
ER -