Building structured databases of factual knowledge from massive text corpora

Xiang Ren, Meng Jiang, Jingbo Shang, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate reports, advertisements, legal acts, medical reports). To turn such massive unstructured text data into structured, actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations) in the text. In this tutorial, we introduce data-driven methods on mining structured facts (i.e., entities and their relations/attributes for types of interest) from massive text corpora, to construct structured databases of factual knowledge (called Struct-DBs). State-of-the-art information extraction systems have strong reliance on large amounts of task/corpus-specific labeled data (usually created by domain experts). In practice, the scale and efficiency of such a manual annotation process are rather limited, especially when dealing with text corpora of various kinds (domains, languages, genres). We focus on methods that are minimally-supervised, domainindependent, and language-independent for timely StructDB construction across various application domains (news, social media, biomedical, business), and demonstrate on real datasets how these StructDBs aid in data exploration and knowledge discovery.

Original languageEnglish (US)
Title of host publicationSIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages1741-1745
Number of pages5
ISBN (Electronic)9781450341974
DOIs
StatePublished - May 9 2017
Event2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017 - Chicago, United States
Duration: May 14 2017May 19 2017

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
VolumePart F127746
ISSN (Print)0730-8078

Other

Other2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017
Country/TerritoryUnited States
CityChicago
Period5/14/175/19/17

Keywords

  • Attribute discovery
  • Entity recognition and typing
  • Massive text corpora
  • Quality phrase mining
  • Relation extraction

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Building structured databases of factual knowledge from massive text corpora'. Together they form a unique fingerprint.

Cite this