Abstract
The ability of automatically recognizing and typing entities in natural language without prior knowledge (e.g., predefined entity types) is a major challenge in processing such data. Most existing entity typing systems are limited to certain domains, genres, and languages. In this article, we propose a novel unsupervised entity-typing framework by combining symbolic and distributional semantics. We start from learning three types of representations for each entity mention: general semantic representation, specific context representation, and knowledge representation based on knowledge bases. Then we develop a novel joint hierarchical clustering and linking algorithm to type all mentions using these representations. This framework does not rely on any annotated data, predefined typing schema, or handcrafted features; therefore, it can be quickly adapted to a new domain, genre, and/or language. Experiments on genres (news and discussion forum) show comparable performance with state-of-the-art supervised typing systems trained from a large amount of labeled data. Results on various languages (English, Chinese, Japanese, Hausa, and Yoruba) and domains (general and biomedical) demonstrate the portability of our framework.
Original language | English (US) |
---|---|
Pages (from-to) | 19-31 |
Number of pages | 13 |
Journal | Big Data |
Volume | 5 |
Issue number | 1 |
DOIs | |
State | Published - Mar 2017 |
Keywords
- Liberal Information Extraction
- fine-grained entity typing
- multi-level entity mention and representation
- unsupervised learning
ASJC Scopus subject areas
- Information Systems
- Computer Science Applications
- Information Systems and Management