TY - GEN
T1 - Exploiting forum thread structures to improve thread clustering
AU - Pattabiraman, Kumaresh
AU - Sondhi, Parikshit
AU - Zhai, Cheng Xiang
PY - 2013
Y1 - 2013
N2 - Automated clustering of threads within and across web forums will greatly benefit both users and forum administrators in efficiently seeking, managing, and integrating the huge volume of content being generated. While clustering has been studied for other types of data, little work has been done on clustering forum threads; the informal nature and special structure of forum data make it interesting to study how to effectively cluster forum threads. In this paper, we apply three state of the art clustering methods (i.e., hierarchical agglomerative clustering, k-Means, and probabilistic latent semantic analysis) to cluster forum threads and study how to leverage the structure of threads to improve clustering accuracy. We propose three different methods for assigning weights to the posts in a forum thread to achieve more accurate representation of a thread. We evaluate all the methods on data collected from three different Linux forums for both within-forum and across-forum clustering. Our results show that the state of the art methods perform reasonably well for this task, but the performance can be further improved by exploiting thread structures. In particular, a parabolic weighting method that assigns higher weights for both beginning posts and end posts of a thread is shown to consistently outperform a standard clustering method.
AB - Automated clustering of threads within and across web forums will greatly benefit both users and forum administrators in efficiently seeking, managing, and integrating the huge volume of content being generated. While clustering has been studied for other types of data, little work has been done on clustering forum threads; the informal nature and special structure of forum data make it interesting to study how to effectively cluster forum threads. In this paper, we apply three state of the art clustering methods (i.e., hierarchical agglomerative clustering, k-Means, and probabilistic latent semantic analysis) to cluster forum threads and study how to leverage the structure of threads to improve clustering accuracy. We propose three different methods for assigning weights to the posts in a forum thread to achieve more accurate representation of a thread. We evaluate all the methods on data collected from three different Linux forums for both within-forum and across-forum clustering. Our results show that the state of the art methods perform reasonably well for this task, but the performance can be further improved by exploiting thread structures. In particular, a parabolic weighting method that assigns higher weights for both beginning posts and end posts of a thread is shown to consistently outperform a standard clustering method.
KW - Forums
KW - K-Means
KW - Text mining
KW - Thread clustering
KW - Web 2.0
UR - http://www.scopus.com/inward/record.url?scp=84886415322&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84886415322&partnerID=8YFLogxK
U2 - 10.1145/2499178.2499196
DO - 10.1145/2499178.2499196
M3 - Conference contribution
AN - SCOPUS:84886415322
SN - 9781450321075
T3 - ACM International Conference Proceeding Series
SP - 64
EP - 71
BT - International Conference on the Theory of Information Retrieval, ICTIR 2013 Proceedings
T2 - 4th International Conference on the Theory of Information Retrieval, ICTIR 2013
Y2 - 29 September 2013 through 2 October 2013
ER -