Skip to main navigation Skip to search Skip to main content

TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter

  • Xinyang Zhang
  • , Yury Malkov
  • , Omar Florez
  • , Serim Park
  • , Brian McWilliams
  • , Jiawei Han
  • , Ahmed El-Kishky

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

Pre-trained language models (PLMs) are fundamental for natural language processing applications. Most existing PLMs are not tailored to the noisy user-generated text on social media, and the pre-training does not factor in the valuable social engagement logs available in a social network. We present TwHIN-BERT, a multilingual language model productionized at Twitter, trained on in-domain data from the popular social network. TwHIN-BERT differs from prior pre-trained language models as it is trained with not only text-based self-supervision but also with a social objective based on the rich social engagements within a Twitter heterogeneous information network (TwHIN). Our model is trained on 7 billion tweets covering over 100 distinct languages, providing a valuable representation to model short, noisy, user-generated text. We evaluate our model on various multilingual social recommendation and semantic understanding tasks and demonstrate significant metric improvement over established pre-trained language models. We open-source TwHIN-BERT and our curated hashtag prediction and social engagement benchmark datasets to the research community.

Original languageEnglish (US)
Title of host publicationKDD 2023 - Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
PublisherAssociation for Computing Machinery
Pages5597-5607
Number of pages11
ISBN (Electronic)9798400701030
DOIs
StatePublished - Aug 6 2023
Event29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023 - Long Beach, United States
Duration: Aug 6 2023Aug 10 2023

Publication series

NameProceedings of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining
ISSN (Print)2154-817X

Conference

Conference29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, KDD 2023
Country/TerritoryUnited States
CityLong Beach
Period8/6/238/10/23

Keywords

  • language models
  • social engagement
  • social media

ASJC Scopus subject areas

  • Software
  • Information Systems

Fingerprint

Dive into the research topics of 'TwHIN-BERT: A Socially-Enriched Pre-trained Language Model for Multilingual Tweet Representations at Twitter'. Together they form a unique fingerprint.

Cite this