With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method–Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.
ASJC Scopus subject areas