TY - GEN
T1 - HiGitClass
T2 - 19th IEEE International Conference on Data Mining, ICDM 2019
AU - Zhang, Yu
AU - Xu, Frank F.
AU - Li, Sha
AU - Meng, Yu
AU - Wang, Xuan
AU - Li, Qi
AU - Han, Jiawei
N1 - Funding Information:
We thank Xiaotao Gu for useful discussions. The research was sponsored in part by U.S. Army Research Lab. under Cooperative Agreement No. W911NF-09-2-0053 (NSCTA), DARPA under Agreement No. W911NF-17-C0099, National Science Foundation IIS 16-18481, IIS 17-04532, and IIS-17-41317, DTRA HDTRA11810026, and grant 1U54GM114838 awarded by NIGMS through funds provided by the trans-NIH Big Data to Knowledge (BD2K) initiative (www.bd2k.nih.gov). The views and conclusions contained in this paper are those of the authors and should not be interpreted as representing any funding agencies. We thank anonymous reviewers for valuable and insightful feedback.
Publisher Copyright:
© 2019 IEEE.
PY - 2019/11
Y1 - 2019/11
N2 - GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as keyword-driven hierarchical classification. Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the HiGitClass framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that HiGitClass is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification. Code and datasets related to this paper are available at https://github.com/yuzhimanhua/HiGitClass.
AB - GitHub has become an important platform for code sharing and scientific exchange. With the massive number of repositories available, there is a pressing need for topic-based search. Even though the topic label functionality has been introduced, the majority of GitHub repositories do not have any labels, impeding the utility of search and topic-based analysis. This work targets the automatic repository classification problem as keyword-driven hierarchical classification. Specifically, users only need to provide a label hierarchy with keywords to supply as supervision. This setting is flexible, adaptive to the users' needs, accounts for the different granularity of topic labels and requires minimal human effort. We identify three key challenges of this problem, namely (1) the presence of multi-modal signals; (2) supervision scarcity and bias; (3) supervision format mismatch. In recognition of these challenges, we propose the HiGitClass framework, comprising of three modules: heterogeneous information network embedding; keyword enrichment; topic modeling and pseudo document generation. Experimental results on two GitHub repository collections confirm that HiGitClass is superior to existing weakly-supervised and dataless hierarchical classification methods, especially in its ability to integrate both structured and unstructured data for repository classification. Code and datasets related to this paper are available at https://github.com/yuzhimanhua/HiGitClass.
KW - GitHub
KW - Hierarchical classification
KW - Weakly supervised learning
UR - http://www.scopus.com/inward/record.url?scp=85078935012&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85078935012&partnerID=8YFLogxK
U2 - 10.1109/ICDM.2019.00098
DO - 10.1109/ICDM.2019.00098
M3 - Conference contribution
AN - SCOPUS:85078935012
T3 - Proceedings - IEEE International Conference on Data Mining, ICDM
SP - 876
EP - 885
BT - Proceedings - 19th IEEE International Conference on Data Mining, ICDM 2019
A2 - Wang, Jianyong
A2 - Shim, Kyuseok
A2 - Wu, Xindong
PB - Institute of Electrical and Electronics Engineers Inc.
Y2 - 8 November 2019 through 11 November 2019
ER -