TY - GEN
T1 - Leveraging pattern semantics for extracting entities in enterprises
AU - Tao, Fangbo
AU - Zhao, Bo
AU - Fuxman, Ariel
AU - Li, Yang
AU - Han, Jiawei
N1 - Funding Information:
National Science Foundation IIS-1017362, IIS-1320617, and IIS-1354329, HDTRA1-10-1-0120
PY - 2015/5/18
Y1 - 2015/5/18
N2 - Entity Extraction is a process of identifying meaningful en-tities from text documents. In enterprises, extracting enti-ties improves enterprise effciency by facilitating numerous applications, including search, recommendation, etc. How-ever, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redun-dancy of enterprise entities makes previous web-based sys-tems like NELL and OpenIE not effective, since using only high-precision/low-recall patterns like those systems would miss the majority of sparse enterprise entities, while using more low-precision patterns in sparse setting also introduces noise drastically. Second, semantic drift is common in enter-prises (\Blue" refers to \Windows Blue"), such that public signals from the web cannot be directly applied on entities. Moreover, many internal entities never appear on the web. Sparse internal signals are the only source for discovering them. To address these challenges, we propose an end-To-end framework for extracting entities in enterprises, taking the input of enterprise corpus and limited seeds to generate a high-quality entity collection as output. We introduce the novel concept of Semantic Pattern Graph to leverage pub-lic signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined seman-tics, and yield more accurate and complete entities. Experi-ments on Microsoft enterprise data show the effectiveness of our approach.
AB - Entity Extraction is a process of identifying meaningful en-tities from text documents. In enterprises, extracting enti-ties improves enterprise effciency by facilitating numerous applications, including search, recommendation, etc. How-ever, the problem is particularly challenging on enterprise domains due to several reasons. First, the lack of redun-dancy of enterprise entities makes previous web-based sys-tems like NELL and OpenIE not effective, since using only high-precision/low-recall patterns like those systems would miss the majority of sparse enterprise entities, while using more low-precision patterns in sparse setting also introduces noise drastically. Second, semantic drift is common in enter-prises (\Blue" refers to \Windows Blue"), such that public signals from the web cannot be directly applied on entities. Moreover, many internal entities never appear on the web. Sparse internal signals are the only source for discovering them. To address these challenges, we propose an end-To-end framework for extracting entities in enterprises, taking the input of enterprise corpus and limited seeds to generate a high-quality entity collection as output. We introduce the novel concept of Semantic Pattern Graph to leverage pub-lic signals to understand the underlying semantics of lexical patterns, reinforce pattern evaluation using mined seman-tics, and yield more accurate and complete entities. Experi-ments on Microsoft enterprise data show the effectiveness of our approach.
KW - En-Terprise Taxonomy
KW - Enterprise Entity Extraction
KW - Semantic Pattern Graph
UR - http://www.scopus.com/inward/record.url?scp=84968830342&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=84968830342&partnerID=8YFLogxK
U2 - 10.1145/2736277.2741670
DO - 10.1145/2736277.2741670
M3 - Conference contribution
C2 - 26705540
AN - SCOPUS:84968830342
T3 - WWW 2015 - Proceedings of the 24th International Conference on World Wide Web
SP - 1078
EP - 1088
BT - WWW 2015 - Proceedings of the 24th International Conference on World Wide Web
PB - Association for Computing Machinery
T2 - 24th International Conference on World Wide Web, WWW 2015
Y2 - 18 May 2015 through 22 May 2015
ER -