TY - GEN
T1 - On the Power of Pre-Trained Text Representations
T2 - 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2021
AU - Meng, Yu
AU - Huang, Jiaxin
AU - Zhang, Yu
AU - Han, Jiawei
N1 - Funding Information:
– Overview of the Applications of Pre-Trained Text Representa-tions in Text Mining • Text Embedding and Language Models – Euclidean Context-Free Embeddings [3, 17, 20] – Non-Euclidean Context-Free Embeddings [12, 19, 24] – Contextualized Language Models [5, 6, 10, 21, 27] – Weakly-Supervised Embeddings [11, 16] • Topic Discovery with Embeddings – Traditional Topic Models [1, 2, 18] – Topic Discovery via Clustering Pre-Trained Embeddings [23] – Embedding-Based Discriminative Topic Mining [11, 16] • Weakly-Supervised Text Classification – Flat Text Classification [4, 13, 15, 25] – Text Classification with Taxonomy Information [14, 22] – Text Classification with Metadata Information [29, 30] • Other Text Mining Applications Empowered by Pre-Trained Language Models – Phrase/Entity Mining [7] – Named Entity Recognition [26] – Taxonomy Construction [9] – Aspect-Based Sentiment Analysis [8] – Text Summarization [28] • Summary and Future Directions ACKNOWLEDGMENTS Research was supported in part by US DARPA KAIROS Program No. FA8750-19-2-1004 and SocialSim Program No. W911NF-17-C-0099, National Science Foundation IIS-19-56151, IIS-17-41317, and IIS-17-04532, and the Molecule Maker Lab Institute: An AI Research Institutes program supported by NSF under Award No. 2019897. Any opinions, findings, and conclusions or recommendations expressed herein are those of the authors and do not necessarily represent the views, either expressed or implied, of DARPA or the U.S. Government.
Publisher Copyright:
© 2021 Owner/Author.
PY - 2021/8/14
Y1 - 2021/8/14
N2 - Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch. In this tutorial, we introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. Finally, we demonstrate with real-world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses.
AB - Recent years have witnessed the enormous success of text representation learning in a wide range of text mining tasks. Earlier word embedding learning approaches represent words as fixed low-dimensional vectors to capture their semantics. The word embeddings so learned are used as the input features of task-specific models. Recently, pre-trained language models (PLMs), which learn universal language representations via pre-training Transformer-based neural models on large-scale text corpora, have revolutionized the natural language processing (NLP) field. Such pre-trained representations encode generic linguistic features that can be transferred to almost any text-related applications. PLMs outperform previous task-specific models in many applications as they only need to be fine-tuned on the target corpus instead of being trained from scratch. In this tutorial, we introduce recent advances in pre-trained text embeddings and language models, as well as their applications to a wide range of text mining tasks. Specifically, we first overview a set of recently developed self-supervised and weakly-supervised text embedding methods and pre-trained language models that serve as the fundamentals for downstream tasks. We then present several new methods based on pre-trained text embeddings and language models for various text mining applications such as topic discovery and text classification. We focus on methods that are weakly-supervised, domain-independent, language-agnostic, effective and scalable for mining and discovering structured knowledge from large-scale text corpora. Finally, we demonstrate with real-world datasets how pre-trained text representations help mitigate the human annotation burden and facilitate automatic, accurate and efficient text analyses.
KW - language models
KW - text embedding
KW - text mining
KW - topic discovery
UR - http://www.scopus.com/inward/record.url?scp=85114916346&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85114916346&partnerID=8YFLogxK
U2 - 10.1145/3447548.3470810
DO - 10.1145/3447548.3470810
M3 - Conference contribution
AN - SCOPUS:85114916346
T3 - Proceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
SP - 4052
EP - 4053
BT - KDD 2021 - Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PB - Association for Computing Machinery
Y2 - 14 August 2021 through 18 August 2021
ER -