Abstract
While most topic modeling algorithms model text corpora with unigrams, human interpretation often relies on inher- ent grouping of terms into phrases. As such, we consider the problem of discovering topical phrases of mixed lengths. Existing work either performs post processing to the results of unigram-based topic models, or utilizes complex n-gram- discovery topic models. These methods generally produce low-quality topical phrases or suffer from poor scalability on even moderately-sized datasets. We propose a different ap- proach that is both computationally eficient and effective. Our solution combines a novel phrase mining framework to segment a document into single and multi-word phrases, and a new topic model that operates on the induced docu- ment partition. Our approach discovers high quality topical phrases with negligible extra cost to the bag-of-words topic model in a variety of datasets including research publication titles, abstracts, reviews, and news articles.
Original language | English (US) |
---|---|
Pages (from-to) | 305-316 |
Number of pages | 12 |
Journal | Proceedings of the VLDB Endowment |
Volume | 8 |
Issue number | 3 |
DOIs | |
State | Published - Nov 2014 |
Event | 3rd Workshop on Spatio-Temporal Database Management, STDBM 2006, Co-located with the 32nd International Conference on Very Large Data Bases, VLDB 2006 - Seoul, Korea, Republic of Duration: Sep 11 2006 → Sep 11 2006 |
ASJC Scopus subject areas
- Computer Science (miscellaneous)
- General Computer Science