KnowSim: A document similarity measure on structured heterogeneous information networks

Chenguang Wang, Yangqiu Song, Haoran Li, Ming Zhang, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

As a fundamental task, document similarity measure has broad impact to document-based classification, clustering and ranking. Traditional approaches represent documents as bag-of-words and compute document similarities using measures like cosine, Jaccard, and dice. However, entity phrases rather than single words in documents can be critical for evaluating document relatedness. Moreover, types of entities and links between entities/words are also informative. We propose a method to represent a document as a typed heterogeneous information network (HIN), where the entities and relations are annotated with types. Multiple documents can be linked by the words and entities in the HIN. Consequently, we convert the document similarity problem to a graph distance problem. Intuitively, there could be multiple paths between a pair of documents. We propose to use the meta-path defined in HIN to compute distance between documents. Instead of burdening user to define meaningful meta paths, an automatic method is proposed to rank the meta-paths. Given the meta-paths associated with ranking scores, an HIN-based similarity measure, KnowSim, is proposed to compute document similarities. Using Freebase, a well-known world knowledge base, to conduct semantic parsing and construct HIN for documents, our experiments on 20Newsgroups and RCV1 datasets show that KnowSim generates impressive high-quality document clustering.

Original languageEnglish (US)
Title of host publicationProceedings - 15th IEEE International Conference on Data Mining, ICDM 2015
EditorsCharu Aggarwal, Zhi-Hua Zhou, Alexander Tuzhilin, Hui Xiong, Xindong Wu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages1015-1020
Number of pages6
ISBN (Electronic)9781467395038
DOIs
StatePublished - Jan 5 2016
Event15th IEEE International Conference on Data Mining, ICDM 2015 - Atlantic City, United States
Duration: Nov 14 2015Nov 17 2015

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
Volume2016-January
ISSN (Print)1550-4786

Other

Other15th IEEE International Conference on Data Mining, ICDM 2015
Country/TerritoryUnited States
CityAtlantic City
Period11/14/1511/17/15

Keywords

  • Document similarity
  • Heterogeneous information network
  • Knowledge base
  • Knowledge graph
  • Structured text similarity

ASJC Scopus subject areas

  • General Engineering

Fingerprint

Dive into the research topics of 'KnowSim: A document similarity measure on structured heterogeneous information networks'. Together they form a unique fingerprint.

Cite this