TY - JOUR
T1 - Reproducible Extraction of Cross-lingual Topics (rectr)
AU - Chan, Chung Hong
AU - Zeng, Jing
AU - Wessler, Hartmut
AU - Jungblut, Marc
AU - Welbers, Kasper
AU - Bajjalieh, Joseph W.
AU - van Atteveldt, Wouter
AU - Althaus, Scott L.
N1 - Publisher Copyright:
© 2020 Taylor & Francis Group, LLC.
PY - 2020
Y1 - 2020
N2 - With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method–Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.
AB - With global media content databases and online content being available, analyzing topical structures in different languages simultaneously has become an urgent computational task. Some previous studies have analyzed topics in a multilingual corpus by translating all items into a single language using a machine translation service, such as Google Translate. We argue that this method is not reproducible in the long run and proposes a new method–Reproducible Extraction of Cross-lingual Topics Using R (rectr). Our method utilizes open-source-aligned word embeddings to understand the cross-lingual meanings of words and has a mechanism to normalize residual influence from language differences. We present a benchmark that compares the topics extracted from a corpus of English, German, and French news using our method with methods used in the literature. We show that our method is not only reproducible but can also generate high-quality cross-lingual topics. We demonstrate how our method can be applied in tracking news topics across time and languages.
UR - http://www.scopus.com/inward/record.url?scp=85090468046&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85090468046&partnerID=8YFLogxK
U2 - 10.1080/19312458.2020.1812555
DO - 10.1080/19312458.2020.1812555
M3 - Article
AN - SCOPUS:85090468046
SN - 1931-2458
VL - 14
SP - 285
EP - 305
JO - Communication Methods and Measures
JF - Communication Methods and Measures
IS - 4
ER -