Mining quality phrases from massive text corpora

Jialu Liu, Jingbo Shang, Chi Wang, Xiang Ren, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Text data are ubiquitous and play an essential role in big data applications. However, text data are mostly unstructured. Transforming unstructured text into structured units (e.g., semantically meaningful phrases) will substantially reduce semantic ambiguity and enhance the power and efficiency at manipulating such data using database technology. Thus mining quality phrases is a critical research problem in the field of databases. In this paper, we propose a new framework that extracts quality phrases from text corpora integrated with phrasal segmentation. The framework requires only limited training but the quality of phrases so generated is close to human judgment. Moreover, the method is scalable: both computation time and required space grow linearly as corpus size increases. Our experiments on large text corpora demonstrate the quality and efficiency of the new method.

Original languageEnglish (US)
Title of host publicationSIGMOD 2015 - Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1729-1744
Number of pages16
ISBN (Electronic)9781450327589
DOIs
StatePublished - May 27 2015
EventACM SIGMOD International Conference on Management of Data, SIGMOD 2015 - Melbourne, Australia
Duration: May 31 2015Jun 4 2015

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
Volume2015-May
ISSN (Print)0730-8078

Other

OtherACM SIGMOD International Conference on Management of Data, SIGMOD 2015
CountryAustralia
CityMelbourne
Period5/31/156/4/15

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint Dive into the research topics of 'Mining quality phrases from massive text corpora'. Together they form a unique fingerprint.

Cite this