HiGitClass: Keyword-driven hierarchical classification of github repositories

Yu Zhang, Frank F. Xu, Sha Li, Yu Meng, Xuan Wang, Qi Li, Jiawei Han

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as keyword-driven hierarchical classification. Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the HiGitClass framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that HiGitClass is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification. Code and datasets related to this paper are available at https://github.com/yuzhimanhua/HiGitClass.

Original languageEnglish (US)
Title of host publicationProceedings - 19th IEEE International Conference on Data Mining, ICDM 2019
EditorsJianyong Wang, Kyuseok Shim, Xindong Wu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages876-885
Number of pages10
ISBN (Electronic)9781728146034
DOIs
StatePublished - Nov 2019
Event19th IEEE International Conference on Data Mining, ICDM 2019 - Beijing, China
Duration: Nov 8 2019Nov 11 2019

Publication series

NameProceedings - IEEE International Conference on Data Mining, ICDM
Volume2019-November
ISSN (Print)1550-4786

Conference

Conference19th IEEE International Conference on Data Mining, ICDM 2019
CountryChina
CityBeijing
Period11/8/1911/11/19

Keywords

  • GitHub
  • Hierarchical classification
  • Weakly supervised learning

ASJC Scopus subject areas

  • Engineering(all)

Fingerprint Dive into the research topics of 'HiGitClass: Keyword-driven hierarchical classification of github repositories'. Together they form a unique fingerprint.

Cite this