Mining Structures from Massive Texts by Exploring the Power of Pre-trained Language Models

Yu Zhang, Yunyi Zhang, Jiawei Han

Research output: Contribution to journalConference articlepeer-review

Abstract

Technologies for handling massive structured or semi-structured data have been researched extensively in database communities. However, the real-world data are largely in the form of unstructured text, posing a great challenge to their management and analysis as well as their integration with semi-structured databases. Recent developments of deep learning methods and large pre-trained language models (PLMs) have revolutionized text mining and processing and shed new light on structuring massive text data and building a framework for integrated (i.e., structured and unstructured) data management and analysis. In this tutorial, we will focus on the recently developed text mining approaches empowered by PLMs that can work without relying on heavy human annotations. We will present an organized picture of how a set of weakly supervised methods explore the power of PLMs to structure text data, with the following outline: (1) an introduction to pre-trained language models that serve as new tools for our tasks, (2) mining topic structures: unsupervised and seed-guided methods for topic discovery from massive text corpora, (3) mining document structures: weakly supervised methods for text classification, (4) mining entity structures: distantly supervised and weakly supervised methods for phrase mining, named entity recognition, taxonomy construction, and structured knowledge graph construction, and (5) towards an integrated information processing paradigm.

Original languageEnglish (US)
Pages (from-to)851-854
Number of pages4
JournalAdvances in Database Technology - EDBT
Volume26
Issue number3
DOIs
StatePublished - Mar 20 2023
Event26th International Conference on Extending Database Technology, EDBT 2023 - Ioannina, Greece
Duration: Mar 28 2023Mar 31 2023

ASJC Scopus subject areas

  • Information Systems
  • Software
  • Computer Science Applications

Fingerprint

Dive into the research topics of 'Mining Structures from Massive Texts by Exploring the Power of Pre-trained Language Models'. Together they form a unique fingerprint.

Cite this