TY - GEN
T1 - Unsupervised concept categorization and extraction from scientific document titles
AU - Krishnan, Adit
AU - Sankar, Aravind
AU - Zhi, Shi
AU - Han, Jiawei
N1 - Funding Information:
Research was sponsored in part by the U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), National Science Foundation IIS 16-18481 and NSF IIS 17-04532, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative.
Publisher Copyright:
© 2017 ACM.
PY - 2017/11/6
Y1 - 2017/11/6
N2 - This paper studies the automated categorization and extraction of scientific concepts from titles of scientific articles, in order to gain a deeper understanding of their key contributions and facilitate the construction of a generic academic knowledgebase. Towards this goal, we propose an unsupervised, domain-independent, and scalable two-phase algorithm to type and extract key concept mentions into aspects of interest (e.g., Techniques, Applications, etc.). In the first phase of our algorithm we propose PhraseType, a probabilistic generative model which exploits textual features and limited POS tags to broadly segment text snippets into aspect-typed phrases. We extend this model to simultaneously learn aspect-specific features and identify academic domains in multi-domain corpora, since the two tasks mutually enhance each other. In the second phase, we propose an approach based on adaptor grammars to extract fine grained concept mentions from the aspect-typed phrases without the need for any external resources or human e.ort, in a purely data-driven manner. We apply our technique to study literature from diverse scientific domains and show significant gains over state-of-the-art concept extraction techniques. We also present a qualitative analysis of the results obtained.
AB - This paper studies the automated categorization and extraction of scientific concepts from titles of scientific articles, in order to gain a deeper understanding of their key contributions and facilitate the construction of a generic academic knowledgebase. Towards this goal, we propose an unsupervised, domain-independent, and scalable two-phase algorithm to type and extract key concept mentions into aspects of interest (e.g., Techniques, Applications, etc.). In the first phase of our algorithm we propose PhraseType, a probabilistic generative model which exploits textual features and limited POS tags to broadly segment text snippets into aspect-typed phrases. We extend this model to simultaneously learn aspect-specific features and identify academic domains in multi-domain corpora, since the two tasks mutually enhance each other. In the second phase, we propose an approach based on adaptor grammars to extract fine grained concept mentions from the aspect-typed phrases without the need for any external resources or human e.ort, in a purely data-driven manner. We apply our technique to study literature from diverse scientific domains and show significant gains over state-of-the-art concept extraction techniques. We also present a qualitative analysis of the results obtained.
KW - Adaptor grammar
KW - Concept extraction
KW - Probabilistic model
UR - http://www.scopus.com/inward/record.url?scp=85037340684&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85037340684&partnerID=8YFLogxK
U2 - 10.1145/3132847.3133023
DO - 10.1145/3132847.3133023
M3 - Conference contribution
AN - SCOPUS:85037340684
T3 - International Conference on Information and Knowledge Management, Proceedings
SP - 1339
EP - 1348
BT - CIKM 2017 - Proceedings of the 2017 ACM Conference on Information and Knowledge Management
PB - Association for Computing Machinery
T2 - 26th ACM International Conference on Information and Knowledge Management, CIKM 2017
Y2 - 6 November 2017 through 10 November 2017
ER -