Unsupervised meta-path selection for text similarity measure based on heterogeneous information networks

Chenguang Wang, Yangqiu Song, Haoran Li, Ming Zhang, Jiawei Han

Research output: Contribution to journalArticle

Abstract

Heterogeneous information network (HIN) is a general representation of many different applications, such as social networks, scholar networks, and knowledge networks. A key development of HIN is called PathSim based on meta-path, which measures the pairwise similarity of two entities in the HIN of the same type. When using PathSim in practice, we usually need to handcraft some meta-paths which are paths over entity types instead of entities themselves. However, finding useful meta-paths is not trivial to human. In this paper, we present an unsupervised meta-path selection approach to automatically find useful meta-paths over HIN, and then develop a new similarity measure called KnowSim which is an ensemble of selected meta-paths. To solve the high computational cost of enumerating all possible meta-paths, we propose to use an approximate personalized PageRank algorithm to find useful subgraphs to allocate the meta-paths. We apply KnowSim to text clustering and classification problems to demonstrate that unsupervised meta-path selection can help improve the clustering and classification results. We use Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents. Our experiments on 20Newsgroups and RCV1 datasets show that KnowSim results in impressive high-quality document clustering and classification performance. We also demonstrate the approximate personalized PageRank algorithm can efficiently and effectively compute the meta-path based similarity.

Original languageEnglish (US)
Pages (from-to)1735-1767
Number of pages33
JournalData Mining and Knowledge Discovery
Volume32
Issue number6
DOIs
StatePublished - Nov 1 2018

    Fingerprint

Keywords

  • Heterogeneous information network
  • Similarity
  • Text categorization

ASJC Scopus subject areas

  • Information Systems
  • Computer Science Applications
  • Computer Networks and Communications

Cite this