Automatic entity recognition and typing in massive text data

Xiang Ren, Ahmed El-Kishky, Heng Ji, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. biomedical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.

Original languageEnglish (US)
Title of host publicationSIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
PublisherAssociation for Computing Machinery
Pages2235-2239
Number of pages5
ISBN (Electronic)9781450335317
DOIs
StatePublished - Jun 26 2016
Event2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016 - San Francisco, United States
Duration: Jun 26 2016Jul 1 2016

Publication series

NameProceedings of the ACM SIGMOD International Conference on Management of Data
Volume26-June-2016
ISSN (Print)0730-8078

Other

Other2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
Country/TerritoryUnited States
CitySan Francisco
Period6/26/167/1/16

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'Automatic entity recognition and typing in massive text data'. Together they form a unique fingerprint.

Cite this