Scalable topical phrase mining from text corpora

Ahmed ElKishky, Yanglei Song, Chi Wangx, Clare R. Voss, Jiawei Han

Research output: Contribution to journalConference article

Abstract

While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inher- ent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the results of unigram-based topic models, or utilizes complex n-gram- discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized datasets. We propose a different ap- proach that is both computationally eficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced docu- ment partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.

Original languageEnglish (US)
Pages (from-to)305-316
Number of pages12
JournalProceedings of the VLDB Endowment
Volume8
Issue number3
DOIs
StatePublished - Nov 2014
Event3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006 - Seoul, Korea, Republic of
Duration: Sep 11 2006Sep 11 2006

    Fingerprint

ASJC Scopus subject areas

  • Computer Science (miscellaneous)
  • Computer Science(all)

Cite this