TY - GEN
T1 - WikiCSSH
T2 - 24th East-European Conference on Advances in Databases and Information Systems, ADBIS 2020, the 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, and the 16th Workshop on Business Intelligence and Big Data, EDA 2020
AU - Han, Kanyao
AU - Yang, Pingjing
AU - Mishra, Shubhanshu
AU - Diesner, Jana
N1 - Publisher Copyright:
© 2020, Springer Nature Switzerland AG.
PY - 2020
Y1 - 2020
N2 - Domain-specific classification schemas (or subject heading vocabularies) are often used to identify, classify, and disambiguate concepts that occur in scholarly articles. In this work, we develop, apply, and evaluate a human-in-the-loop workflow that first extracts an initial category tree from crowd-sourced Wikipedia data, and then combines community detection, machine learning, and hand-crafted heuristics or rules to prune the initial tree. This work resulted in WikiCSSH; a large-scale, hierarchically-organized subject heading vocabulary for the domain of computer science (CS). Our evaluation suggests that WikiCSSH outperforms alternative CS vocabularies in terms of coverage of CS terms that occur in research articles. WikiCSSH can further distinguish between coarse-grained versus fine-grained CS concepts. The outlined workflow can serve as a template for building hierarchically-organized subject heading vocabularies for other domains that are covered in Wikipedia.
AB - Domain-specific classification schemas (or subject heading vocabularies) are often used to identify, classify, and disambiguate concepts that occur in scholarly articles. In this work, we develop, apply, and evaluate a human-in-the-loop workflow that first extracts an initial category tree from crowd-sourced Wikipedia data, and then combines community detection, machine learning, and hand-crafted heuristics or rules to prune the initial tree. This work resulted in WikiCSSH; a large-scale, hierarchically-organized subject heading vocabulary for the domain of computer science (CS). Our evaluation suggests that WikiCSSH outperforms alternative CS vocabularies in terms of coverage of CS terms that occur in research articles. WikiCSSH can further distinguish between coarse-grained versus fine-grained CS concepts. The outlined workflow can serve as a template for building hierarchically-organized subject heading vocabularies for other domains that are covered in Wikipedia.
KW - Computer science
KW - Hierarchical vocabulary
KW - Wikipedia
UR - http://www.scopus.com/inward/record.url?scp=85090098225&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090098225&partnerID=8YFLogxK
U2 - 10.1007/978-3-030-55814-7_17
DO - 10.1007/978-3-030-55814-7_17
M3 - Conference contribution
AN - SCOPUS:85090098225
SN - 9783030558130
T3 - Communications in Computer and Information Science
SP - 207
EP - 218
BT - ADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium - International Workshops
A2 - Bellatreche, Ladjel
A2 - Bieliková, Mária
A2 - Boussaïd, Omar
A2 - Darmont, Jérôme
A2 - Catania, Barbara
A2 - Demidova, Elena
A2 - Duchateau, Fabien
A2 - Hall, Mark
A2 - Mercun, Tanja
A2 - Žumer, Maja
A2 - Novikov, Boris
A2 - Papatheodorou, Christos
A2 - Risse, Thomas
A2 - Romero, Oscar
A2 - Sautot, Lucile
A2 - Talens, Guilaine
A2 - Wrembel, Robert
PB - Springer
Y2 - 25 August 2020 through 27 August 2020
ER -