TY - JOUR
T1 - Reproducible Extraction of Cross-lingual Topics (rectr)
AU - Chan, Chung Hong
AU - Zeng, Jing
AU - Wessler, Hartmut
AU - Jungblut, Marc
AU - Welbers, Kasper
AU - Bajjalieh, Joseph W.
AU - van Atteveldt, Wouter
AU - Althaus, Scott L.
N1 - Funding Information:
This project was funded by a research grant from the 1) German Research Foundation (Deutsche Forschungsgemeinschaft), 2) The Netherlands Organisation for Scientific Research (Nederlandse Organisatie voor Wetenschappelijk Onderzoek) and 3) the National Endowment for the Humanities, through the Trans-Atlantic Platform’s Digging into Data Challenge funding program. The authors would like to thank Fabienne Lind (Computational Communication Science Lab, University of Vienna) for her comments that greatly improved this manuscript; Valeria Glauser (University of Zurich) for her help in trilingual coding
Publisher Copyright:
© 2020 Taylor & Francis Group, LLC.
PY - 2020
Y1 - 2020
N2 - With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method–Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.
AB - With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method–Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.
UR - http://www.scopus.com/inward/record.url?scp=85090468046&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090468046&partnerID=8YFLogxK
U2 - 10.1080/19312458.2020.1812555
DO - 10.1080/19312458.2020.1812555
M3 - Article
AN - SCOPUS:85090468046
SN - 1931-2458
VL - 14
SP - 285
EP - 305
JO - Communication Methods and Measures
JF - Communication Methods and Measures
IS - 4
ER -