Word embedding enrichment for dictionary construction: An example of incivility in Cantonese

Hai Liang, Yee Man Margaret Ng, Nathan L.T. Tsang

Research output: Contribution to journalArticlepeer-review

Abstract

Dictionary-based methods remain valuable to measure concepts based on texts, though supervised machine learning has been widely used in much recent communication research. The present study proposes a semi-automatic and easily implemented method to build and enrich dictionaries based on word embeddings. As an example, we create a dictionary of political incivility that contains vulgarity and name-calling words in Cantonese. The study shows that dictionary-based classification outperforms supervised machine learning methods, including deep neural network models. Furthermore, a small number of random seed words can generate a highly accurate dictionary. However, the uncivil content detected is only weakly correlated with uncivil perceptions, as we demonstrate in a population-based survey experiment. The strengths and limitations of dictionary-based methods are discussed.

Original languageEnglish (US)
JournalComputational Communication Research
Volume5
Issue number1
DOIs
StatePublished - 2023

Keywords

  • Cantonese
  • dictionary construction
  • machine learning
  • political incivility
  • swearing

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Linguistics and Language

Fingerprint

Dive into the research topics of 'Word embedding enrichment for dictionary construction: An example of incivility in Cantonese'. Together they form a unique fingerprint.

Cite this