TY - GEN
T1 - Minimally-supervised structure-rich text categorization via learning on text-rich networks
AU - Zhang, Xinyang
AU - Zhang, Chenwei
AU - Dong, Xin Luna
AU - Shang, Jingbo
AU - Han, Jiawei
N1 - Publisher Copyright:
© 2021 ACM.
PY - 2021/4/19
Y1 - 2021/4/19
N2 - Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus' heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases - a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents.
AB - Text categorization is an essential task in Web content analysis. Considering the ever-evolving Web data and new emerging categories, instead of the laborious supervised setting, in this paper, we focus on the minimally-supervised setting that aims to categorize documents effectively, with a couple of seed documents annotated per category. We recognize that texts collected from the Web are often structure-rich, i.e., accompanied by various metadata. One can easily organize the corpus into a text-rich network, joining raw text documents with document attributes, high-quality phrases, label surface names as nodes, and their associations as edges. Such a network provides a holistic view of the corpus' heterogeneous data sources and enables a joint optimization for network-based analysis and deep textual model training. We therefore propose a novel framework for minimally supervised categorization by learning from the text-rich network. Specifically, we jointly train two modules with different inductive biases - a text analysis module for text understanding and a network learning module for class-discriminative, scalable network learning. Each module generates pseudo training labels from the unlabeled document set, and both modules mutually enhance each other by co-training using pooled pseudo labels. We test our model on two real-world datasets. On the challenging e-commerce product categorization dataset with 683 categories, our experiments show that given only three seed documents per category, our framework can achieve an accuracy of about 92%, significantly outperforming all compared methods; our accuracy is only less than 2% away from the supervised BERT model trained on about 50K labeled documents.
KW - Minimal supervision
KW - Text categorization
KW - Text-rich networks
UR - http://www.scopus.com/inward/record.url?scp=85107987351&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85107987351&partnerID=8YFLogxK
U2 - 10.1145/3442381.3450114
DO - 10.1145/3442381.3450114
M3 - Conference contribution
AN - SCOPUS:85107987351
T3 - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
SP - 3258
EP - 3268
BT - The Web Conference 2021 - Proceedings of the World Wide Web Conference, WWW 2021
PB - Association for Computing Machinery
T2 - 2021 World Wide Web Conference, WWW 2021
Y2 - 19 April 2021 through 23 April 2021
ER -