Abstract
This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.
Original language | English (US) |
---|---|
Title of host publication | Proceedings of the 13th Language Resources and Evaluation Conference |
Editors | Nicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis |
Publisher | European Language Resources Association (ELRA) |
Pages | 6530-6539 |
Number of pages | 10 |
ISBN (Electronic) | 9791095546726 |
State | Published - 2022 |
Externally published | Yes |
Event | 13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France Duration: Jun 20 2022 → Jun 25 2022 |
Conference
Conference | 13th International Conference on Language Resources and Evaluation Conference, LREC 2022 |
---|---|
Country/Territory | France |
City | Marseille |
Period | 6/20/22 → 6/25/22 |
Keywords
- Austronesian languages
- code-switching detection
- language identification
- low-resource languages
ASJC Scopus subject areas
- Language and Linguistics
- Library and Information Sciences
- Linguistics and Language
- Education