TY - GEN
T1 - Text classification using label names only
T2 - 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020
AU - Meng, Yu
AU - Zhang, Yunyi
AU - Huang, Jiaxin
AU - Xiong, Chenyan
AU - Ji, Heng
AU - Zhang, Chao
AU - Han, Jiawei
N1 - Publisher Copyright:
© 2020 Association for Computational Linguistics.
PY - 2020
Y1 - 2020
N2 - Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name.
AB - Current text classification methods typically require a good number of human-labeled documents as training data, which can be costly and difficult to obtain in real applications. Humans can perform classification without seeing any labeled examples but only based on a small set of words describing the categories to be classified. In this paper, we explore the potential of only using the label name of each class to train classification models on unlabeled data, without using any labeled documents. We use pre-trained neural language models both as general linguistic knowledge sources for category understanding and as representation learning models for document classification. Our method (1) associates semantically related words with the label names, (2) finds category-indicative words and trains the model to predict their implied categories, and (3) generalizes the model via self-training. We show that our model achieves around 90% accuracy on four benchmark datasets including topic and sentiment classification without using any labeled documents but learning from unlabeled data supervised by at most 3 words (1 in most cases) per class as the label name.
UR - http://www.scopus.com/inward/record.url?scp=85112348249&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85112348249&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85112348249
T3 - EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
SP - 9006
EP - 9017
BT - EMNLP 2020 - 2020 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference
PB - Association for Computational Linguistics (ACL)
Y2 - 16 November 2020 through 20 November 2020
ER -