TY - JOUR
T1 - Automated Phrase Mining from Massive Text Corpora
AU - Shang, Jingbo
AU - Liu, Jialu
AU - Jiang, Meng
AU - Ren, Xiang
AU - Voss, Clare R.
AU - Han, Jiawei
N1 - Funding Information:
This research was sponsored in part by the U.S. Army Research Lab. under Cooperative Agreement No. W911NF- 09-2-0053 (NSCTA), National Science Foundation IIS- 1320617 and IIS 16-18481, grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov), and a Google PhD Fellowship. The views and conclusions contained in this document are those of the author(s) and should not be interpreted as representing the official policies of the U.S. Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
Funding Information:
This research was sponsored in part by the U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), National Science Foundation IIS-1320617 and IIS 16-18481, grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov), and a Google PhD Fellowship. The views and conclusions contained in this document are those of the author(s) and should not be interpreted as representing the official policies of the U.S. Army Research Laboratory or the U.S. Government. The U.S. Government is authorized to reproduce and distribute reprints for Government purposes notwithstanding any copyright notation hereon.
Publisher Copyright:
© 2018 IEEE.
PY - 2018/10/1
Y1 - 2018/10/1
N2 - As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.
AB - As one of the fundamental tasks in text analysis, phrase mining aims at extracting quality phrases from a text corpus and has various downstream applications including information extraction/retrieval, taxonomy construction, and topic modeling. Most existing methods rely on complex, trained linguistic analyzers, and thus likely have unsatisfactory performance on text corpora of new domains and genres without extra but expensive adaption. None of the state-of-the-art models, even data-driven models, is fully automated because they require human experts for designing rules or labeling phrases. In this paper, we propose a novel framework for automated phrase mining, AutoPhrase, which supports any language as long as a general knowledge base (e.g., Wikipedia) in that language is available, while benefiting from, but not requiring, a POS tagger. Compared to the state-of-the-art methods, AutoPhrase has shown significant improvements in both effectiveness and efficiency on five real-world datasets across different domains and languages. Besides, AutoPhrase can be extended to model single-word quality phrases.
KW - Automatic phrase mining
KW - distant training
KW - multiple languages
KW - part-of-speech tag
KW - phrase mining
UR - http://www.scopus.com/inward/record.url?scp=85042876565&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85042876565&partnerID=8YFLogxK
U2 - 10.1109/TKDE.2018.2812203
DO - 10.1109/TKDE.2018.2812203
M3 - Article
C2 - 31105412
AN - SCOPUS:85042876565
SN - 1041-4347
VL - 30
SP - 1825
EP - 1837
JO - IEEE Transactions on Knowledge and Data Engineering
JF - IEEE Transactions on Knowledge and Data Engineering
IS - 10
M1 - 8306825
ER -