WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia

Kanyao Han, Pingjing Yang, Shubhanshu Mishra, Jana Diesner

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Domain-specific classification schemas (or subject heading vocabularies) are often used to identify, classify, and disambiguate concepts that occur in scholarly articles. In this work, we develop, apply, and evaluate a human-in-the-loop workflow that first extracts an initial category tree from crowd-sourced Wikipedia data, and then combines community detection, machine learning, and hand-crafted heuristics or rules to prune the initial tree. This work resulted in WikiCSSH; a large-scale, hierarchically-organized subject heading vocabulary for the domain of computer science (CS). Our evaluation suggests that WikiCSSH outperforms alternative CS vocabularies in terms of coverage of CS terms that occur in research articles. WikiCSSH can further distinguish between coarse-grained versus fine-grained CS concepts. The outlined workflow can serve as a template for building hierarchically-organized subject heading vocabularies for other domains that are covered in Wikipedia.

Original languageEnglish (US)
Title of host publicationADBIS, TPDL and EDA 2020 Common Workshops and Doctoral Consortium - International Workshops
Subtitle of host publicationDOING, MADEISD, SKG, BBIGAP, SIMPDA, AIMinScience 2020 and Doctoral Consortium, Proceedings
EditorsLadjel Bellatreche, Mária Bieliková, Omar Boussaïd, Jérôme Darmont, Barbara Catania, Elena Demidova, Fabien Duchateau, Mark Hall, Tanja Mercun, Maja Žumer, Boris Novikov, Christos Papatheodorou, Thomas Risse, Oscar Romero, Lucile Sautot, Guilaine Talens, Robert Wrembel
PublisherSpringer
Pages207-218
Number of pages12
ISBN (Print)9783030558130
DOIs
StatePublished - 2020
Event24th East-European Conference on Advances in Databases and Information Systems, ADBIS 2020, the 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, and the 16th Workshop on Business Intelligence and Big Data, EDA 2020 - Lyon, France
Duration: Aug 25 2020Aug 27 2020

Publication series

NameCommunications in Computer and Information Science
Volume1260 CCIS
ISSN (Print)1865-0929
ISSN (Electronic)1865-0937

Conference

Conference24th East-European Conference on Advances in Databases and Information Systems, ADBIS 2020, the 24th International Conference on Theory and Practice of Digital Libraries, TPDL 2020, and the 16th Workshop on Business Intelligence and Big Data, EDA 2020
CountryFrance
CityLyon
Period8/25/208/27/20

Keywords

  • Computer science
  • Hierarchical vocabulary
  • Wikipedia

ASJC Scopus subject areas

  • Computer Science(all)
  • Mathematics(all)

Fingerprint Dive into the research topics of 'WikiCSSH: Extracting Computer Science Subject Headings from Wikipedia'. Together they form a unique fingerprint.

Cite this