TY - JOUR
T1 - Mining Structures from Massive Texts by Exploring the Power of Pre-trained Language Models
AU - Zhang, Yu
AU - Zhang, Yunyi
AU - Han, Jiawei
N1 - We thank anonymous reviewers for their valuable and insightful feedback. Research was supported in part by the IBM-Illinois Discovery Accelerator Institute, US DARPA KAIROS Program No. FA8750-19-2-1004 and INCAS Program No. HR001121C0165, National Science Foundation IIS-19-56151, IIS-17-41317, and IIS 17-04532, and the Molecule Maker Lab Institute: An AI Research Institutes program supported by NSF under Award No. 2019897, and the Institute for Geospatial Understanding through an Integrative Discovery Environment (I-GUIDE) by NSF under Award No. 2118329. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily represent the views, either expressed or implied, of DARPA or the U.S. Government.
PY - 2023/3/20
Y1 - 2023/3/20
N2 - Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis. In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline: (1) an introduction to pre-trained language models that serve as new tools for our tasks, (2) mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora, (3) mining document structures: weakly supervised methods for text classification, (4) mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction, and (5) towards an integrated information processing paradigm.
AB - Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis. In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline: (1) an introduction to pre-trained language models that serve as new tools for our tasks, (2) mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora, (3) mining document structures: weakly supervised methods for text classification, (4) mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction, and (5) towards an integrated information processing paradigm.
UR - http://www.scopus.com/inward/record.url?scp=85165033832&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85165033832&partnerID=8YFLogxK
U2 - 10.48786/edbt.2023.81
DO - 10.48786/edbt.2023.81
M3 - Conference article
AN - SCOPUS:85165033832
SN - 2367-2005
VL - 26
SP - 851
EP - 854
JO - Advances in Database Technology - EDBT
JF - Advances in Database Technology - EDBT
IS - 3
T2 - 26th International Conference on Extending Database Technology, EDBT 2023
Y2 - 28 March 2023 through 31 March 2023
ER -