TY - JOUR
T1 - Frequency-sensitive competitive learning for scalable balanced clustering on high-dimensional hyperspheres
AU - Banerjee, Arindam
AU - Ghosh, Joydeep
N1 - Funding Information:
Manuscript recieved October 4, 2002. This work was supported in part by the National Science Foundation (NSF) under Grant ECS-9900353 and in part by the Intel Corporation under Research Grant 8032. The authors are with the Department of Electrical and Computer Engineering, University of Texas at Austin, Austin, TX 78712 USA (e-mail: [email protected]; [email protected]). Digital Object Identifier 10.1109/TNN.2004.824416
PY - 2004/5
Y1 - 2004/5
N2 - Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (sDkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, sDkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the sDkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques.
AB - Competitive learning mechanisms for clustering, in general, suffer from poor performance for very high-dimensional (>1000) data because of "curse of dimensionality" effects. In applications such as document clustering, it is customary to normalize the high-dimensional input vectors to unit length, and it is sometimes also desirable to obtain balanced clusters, i.e., clusters of comparable sizes. The spherical kmeans (sDkmeans) algorithm, which normalizes the cluster centers as well as the inputs, has been successfully used to cluster normalized text documents in 2000+ dimensional space. Unfortunately, like regular kmeans and its soft expectation-maximization-based version, sDkmeans tends to generate extremely imbalanced clusters in high-dimensional spaces when the desired number of clusters is large (tens or more). This paper first shows that the sDkmeans algorithm can be derived from a certain maximum likelihood formulation using a mixture of von Mises-Fisher distributions as the generative model, and in fact, it can be considered as a batch-mode version of (normalized) competitive learning. The proposed generative model is then adapted in a principled way to yield three frequency-sensitive competitive learning variants that are applicable to static data and produced high-quality and well-balanced clusters for high-dimensional data. Like kmeans, each iteration is linear in the number of data points and in the number of clusters for all the three algorithms. A frequency-sensitive algorithm to cluster streaming data is also proposed. Experimental results on clustering of high-dimensional text data sets are provided to show the effectiveness and applicability of the proposed techniques.
KW - Balanced clustering
KW - Expectation maximization (EM)
KW - Frequency-sensitive competitive learning (FSCL)
KW - High-dimensional clustering
KW - Kmeans
KW - Normalized data
KW - Scalable clustering
KW - Streaming data
KW - Text clustering
UR - http://www.scopus.com/inward/record.url?scp=2542583128&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=2542583128&partnerID=8YFLogxK
U2 - 10.1109/TNN.2004.824416
DO - 10.1109/TNN.2004.824416
M3 - Article
C2 - 15384557
AN - SCOPUS:2542583128
SN - 1045-9227
VL - 15
SP - 702
EP - 718
JO - IEEE Transactions on Neural Networks
JF - IEEE Transactions on Neural Networks
IS - 3
ER -