This paper describes our multiclass classification system developed as part of the LT-EDI@RANLP-2023 shared task. We used a BERT-based language model to detect homophobic and transphobic content in social media comments across five language conditions: English, Spanish, Hindi, Malayalam, and Tamil. We retrained a transformer-based cross-language pretrained language model, XLM-RoBERTa, with spatially and temporally relevant social media language data. We found the inclusion of this spatio-temporal data improved the classification performance for all language and task conditions when compared with the baseline. We also retrained a subset of models with simulated script-mixed social media language data with varied performance. The results from the current study suggests that transformer-based language classification systems are sensitive to register-specific and language-specific retraining.
|Title of host publication
|Proceedings of the Third Workshop on Language Technology for Equality, Diversity and Inclusion
|Bharathi R. Chakravarthi, B. Bharathi, Joephine Griffith, Kalika Bali, Paul Buitelaar
|INCOMA Ltd., Shoumen, Bulgaria
|Number of pages
|Published - Sep 1 2023