TY - GEN
T1 - Authorship classification
T2 - ACM SIGKDD Workshop on Useful Patterns, UP'10, in Conjunction with the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
AU - Kim, Sangkyum
AU - Kim, Hyungsul
AU - Weninger, Tim
AU - Han, Jiawei
PY - 2010
Y1 - 2010
N2 - In the past, there have been dozens of studies on automatic authorship classification, and many of these studies concluded that the writing style is one of the best indicators of original authorship. From among the hundreds of features which were developed, syntactic features were best able to reflect an author's writing style. However, due to the high computational complexity of extracting and computing syntactic features, only simple variations of basic syntactic features of function words and part-of-speech tags were considered. In this paper, we propose a novel approach to mining discriminative k-embedded-edge subtree patterns from a given set of syntactic trees that reduces the computational burden of using complex syntactic structures as a feature set. This method is shown to increase the classification accuracy. We also design a new kernel based on these features. Comprehensive experiments on real datasets of news articles and movie reviews demonstrate that our approach is reliable and more accurate than previous studies.
AB - In the past, there have been dozens of studies on automatic authorship classification, and many of these studies concluded that the writing style is one of the best indicators of original authorship. From among the hundreds of features which were developed, syntactic features were best able to reflect an author's writing style. However, due to the high computational complexity of extracting and computing syntactic features, only simple variations of basic syntactic features of function words and part-of-speech tags were considered. In this paper, we propose a novel approach to mining discriminative k-embedded-edge subtree patterns from a given set of syntactic trees that reduces the computational burden of using complex syntactic structures as a feature set. This method is shown to increase the classification accuracy. We also design a new kernel based on these features. Comprehensive experiments on real datasets of news articles and movie reviews demonstrate that our approach is reliable and more accurate than previous studies.
KW - Authorship classification
KW - Closed pattern
KW - Discriminative pattern
KW - Text categorization
KW - Text mining
UR - http://www.scopus.com/inward/record.url?scp=77956234290&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=77956234290&partnerID=8YFLogxK
U2 - 10.1145/1816112.1816121
DO - 10.1145/1816112.1816121
M3 - Conference contribution
AN - SCOPUS:77956234290
SN - 9781450302166
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 65
EP - 73
BT - Proceedings of the ACM SIGKDD Workshop on Useful Patterns, UP'10, in Conjunction with the 16th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Y2 - 25 July 2010 through 25 July 2010
ER -