TY - GEN
T1 - Building structured databases of factual knowledge from massive text corpora
AU - Ren, Xiang
AU - Jiang, Meng
AU - Shang, Jingbo
AU - Han, Jiawei
N1 - Publisher Copyright:
© 2017 ACM.
PY - 2017/5/9
Y1 - 2017/5/9
N2 - In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate reports, advertisements, legal acts, medical reports). To turn such massive unstructured text data into structured, actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations) in the text. In this tutorial, we introduce data-driven methods on mining structured facts (i.e., entities and their relations/attributes for types of interest) from massive text corpora, to construct structured databases of factual knowledge (called Struct-DBs). State-of-the-art information extraction systems have strong reliance on large amounts of task/corpus-specific labeled data (usually created by domain experts). In practice, the scale and efficiency of such a manual annotation process are rather limited, especially when dealing with text corpora of various kinds (domains, languages, genres). We focus on methods that are minimally-supervised, domainindependent, and language-independent for timely StructDB construction across various application domains (news, social media, biomedical, business), and demonstrate on real datasets how these StructDBs aid in data exploration and knowledge discovery.
AB - In today's computerized and information-based society, people are inundated with vast amounts of text data, ranging from news articles, social media post, scientific publications, to a wide range of textual information from various domains (corporate reports, advertisements, legal acts, medical reports). To turn such massive unstructured text data into structured, actionable knowledge, one of the grand challenges is to gain an understanding of the factual information (e.g., entities, attributes, relations) in the text. In this tutorial, we introduce data-driven methods on mining structured facts (i.e., entities and their relations/attributes for types of interest) from massive text corpora, to construct structured databases of factual knowledge (called Struct-DBs). State-of-the-art information extraction systems have strong reliance on large amounts of task/corpus-specific labeled data (usually created by domain experts). In practice, the scale and efficiency of such a manual annotation process are rather limited, especially when dealing with text corpora of various kinds (domains, languages, genres). We focus on methods that are minimally-supervised, domainindependent, and language-independent for timely StructDB construction across various application domains (news, social media, biomedical, business), and demonstrate on real datasets how these StructDBs aid in data exploration and knowledge discovery.
KW - Attribute discovery
KW - Entity recognition and typing
KW - Massive text corpora
KW - Quality phrase mining
KW - Relation extraction
UR - http://www.scopus.com/inward/record.url?scp=85021226338&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85021226338&partnerID=8YFLogxK
U2 - 10.1145/3035918.3054781
DO - 10.1145/3035918.3054781
M3 - Conference contribution
AN - SCOPUS:85021226338
T3 - Proceedings of the ACM SIGMOD International Conference on Management of Data
SP - 1741
EP - 1745
BT - SIGMOD 2017 - Proceedings of the 2017 ACM International Conference on Management of Data
PB - Association for Computing Machinery
T2 - 2017 ACM SIGMOD International Conference on Management of Data, SIGMOD 2017
Y2 - 14 May 2017 through 19 May 2017
ER -