TY - JOUR
T1 - Liberal entity extraction
T2 - Rapid construction of fine-grained entity typing systems
AU - Huang, Lifu
AU - May, Jonathan
AU - Pan, Xiaoman
AU - Ji, Heng
AU - Ren, Xiang
AU - Han, Jiawei
AU - Zhao, Lin
AU - Hendler, James A.
N1 - Funding Information:
This work was supported by the U.S. ARL NS-CTA No. W911NF-09-2-0053 and DARPA DEFT No. FA8750-13-2-0041, and in part by NSF IIS-1523198, IIS-1017362, IIS-1320617 and IIS-1354329, and NIH BD2K grant 1U54GM114838.
Publisher Copyright:
© 2017, Mary Ann Liebert, Inc.
PY - 2017/3
Y1 - 2017/3
N2 - The ability of automatically recognizing and typing entities in natural language without prior knowledge (e.g., predefined entity types) is a major challenge in processing such data. Most existing entity typing systems are limited to certain domains, genres, and languages. In this article, we propose a novel unsupervised entity-typing framework by combining symbolic and distributional semantics. We start from learning three types of representations for each entity mention: general semantic representation, specific context representation, and knowledge representation based on knowledge bases. Then we develop a novel joint hierarchical clustering and linking algorithm to type all mentions using these representations. This framework does not rely on any annotated data, predefined typing schema, or handcrafted features; therefore, it can be quickly adapted to a new domain, genre, and/or language. Experiments on genres (news and discussion forum) show comparable performance with state-of-the-art supervised typing systems trained from a large amount of labeled data. Results on various languages (English, Chinese, Japanese, Hausa, and Yoruba) and domains (general and biomedical) demonstrate the portability of our framework.
AB - The ability of automatically recognizing and typing entities in natural language without prior knowledge (e.g., predefined entity types) is a major challenge in processing such data. Most existing entity typing systems are limited to certain domains, genres, and languages. In this article, we propose a novel unsupervised entity-typing framework by combining symbolic and distributional semantics. We start from learning three types of representations for each entity mention: general semantic representation, specific context representation, and knowledge representation based on knowledge bases. Then we develop a novel joint hierarchical clustering and linking algorithm to type all mentions using these representations. This framework does not rely on any annotated data, predefined typing schema, or handcrafted features; therefore, it can be quickly adapted to a new domain, genre, and/or language. Experiments on genres (news and discussion forum) show comparable performance with state-of-the-art supervised typing systems trained from a large amount of labeled data. Results on various languages (English, Chinese, Japanese, Hausa, and Yoruba) and domains (general and biomedical) demonstrate the portability of our framework.
KW - Liberal Information Extraction
KW - fine-grained entity typing
KW - multi-level entity mention and representation
KW - unsupervised learning
UR - http://www.scopus.com/inward/record.url?scp=85016399563&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85016399563&partnerID=8YFLogxK
U2 - 10.1089/big.2017.0012
DO - 10.1089/big.2017.0012
M3 - Article
C2 - 28328252
AN - SCOPUS:85016399563
SN - 2167-6461
VL - 5
SP - 19
EP - 31
JO - Big Data
JF - Big Data
IS - 1
ER -