TY - GEN
T1 - Automatic entity recognition and typing in massive text data
AU - Ren, Xiang
AU - El-Kishky, Ahmed
AU - Ji, Heng
AU - Han, Jiawei
PY - 2016/6/26
Y1 - 2016/6/26
N2 - In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. biomedical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.
AB - In today's computerized and information-based society, individuals are constantly presented with vast amounts of text data, ranging from news articles, scientific publications, product reviews, to a wide range of textual information from social media. To extract value from these large, multi-domain pools of text, it is of great importance to gain an understanding of entities and their relationships. In this tutorial, we introduce data-driven methods to recognize typed entities of interest in massive, domain-specific text corpora. These methods can automatically identify token spans as entity mentions in documents and label their fine-grained types (e.g., people, product and food) in a scalable way. Since these methods do not rely on annotated data, predefined typing schema or hand-crafted features, they can be quickly adapted to a new domain, genre and language. We demonstrate on real datasets including various genres (e.g., news articles, discussion forum posts, and tweets), domains (general vs. biomedical domains) and languages (e.g., English, Chinese, Arabic, and even low-resource languages like Hausa and Yoruba) how these typed entities aid in knowledge discovery and management.
UR - http://www.scopus.com/inward/record.url?scp=84979691965&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84979691965&partnerID=8YFLogxK
U2 - 10.1145/2882903.2912567
DO - 10.1145/2882903.2912567
M3 - Conference contribution
AN - SCOPUS:84979691965
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 2235
EP - 2239
BT - SIGMOD 2016 - Proceedings of the 2016 International Conference on Management of Data
PB - Association for Computing Machinery
T2 - 2016 ACM SIGMOD International Conference on Management of Data, SIGMOD 2016
Y2 - 26 June 2016 through 1 July 2016
ER -