TY - JOUR
T1 - Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks
AU - Wang, Chenguang
AU - Song, Yangqiu
AU - Li, Haoran
AU - Zhang, Ming
AU - Han, Jiawei
N1 - Funding Information:
Acknowledgements Chenguang Wang, Haoran Li, and Ming Zhang gratefully acknowledge the support by the National Natural Science Foundation of China (NSFC Grant Nos. 61772039, 91646202 and 61472006). Yangqiu Song was supported by China 973 Fundamental R&D Program (No. 2014CB340304) and the Early Career Scheme (ECS, No. 26206717) from Research Grants Council in Hong Kong. Jiawei Han was sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C-0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and Grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www. bd2k.nih.gov). The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied by these agencies. We also thank the conference versions’ and journal version’s anonymous reviewers for their valuable comments and suggestions that help improve the quality of this manuscript.
Publisher Copyright:
© 2018, The Author(s).
PY - 2018/11/1
Y1 - 2018/11/1
N2 - Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.
AB - Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.
KW - Heterogeneous information network
KW - Similarity
KW - Text categorization
UR - http://www.scopus.com/inward/record.url?scp=85049856858&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85049856858&partnerID=8YFLogxK
U2 - 10.1007/s10618-018-0581-y
DO - 10.1007/s10618-018-0581-y
M3 - Article
AN - SCOPUS:85049856858
SN - 1384-5810
VL - 32
SP - 1735
EP - 1767
JO - Data Mining and Knowledge Discovery
JF - Data Mining and Knowledge Discovery
IS - 6
ER -