Language Identification for Austronesian Languages

Jonathan Dunn, Wikke Nijhof

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper provides language identification models for low- and under-resourced languages in the Pacific region with a focus on previously unavailable Austronesian languages. Accurate language identification is an important part of developing language resources. The approach taken in this paper combines 29 Austronesian languages with 171 non-Austronesian languages to create an evaluation set drawn from eight data sources. After evaluating six approaches to language identification, we find that a classifier based on skip-gram embeddings reaches a significantly higher performance than alternate methods. We then systematically increase the number of non-Austronesian languages in the model up to a total of 800 languages to evaluate whether an increased language inventory leads to less precise predictions for the Austronesian languages of interest. This evaluation finds that there is only a minimal impact on accuracy caused by increasing the inventory of non-Austronesian languages. Further experiments adapt these language identification models for code-switching detection, achieving high accuracy across all 29 languages.

Original languageEnglish (US)
Title of host publicationProceedings of the 13th Language Resources and Evaluation Conference
EditorsNicoletta Calzolari, Frederic Bechet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Maegaard, Joseph Mariani, Helene Mazo, Jan Odijk, Stelios Piperidis
PublisherEuropean Language Resources Association (ELRA)
Pages6530-6539
Number of pages10
ISBN (Electronic)9791095546726
StatePublished - 2022
Externally publishedYes
Event13th International Conference on Language Resources and Evaluation Conference, LREC 2022 - Marseille, France
Duration: Jun 20 2022Jun 25 2022

Conference

Conference13th International Conference on Language Resources and Evaluation Conference, LREC 2022
Country/TerritoryFrance
CityMarseille
Period6/20/226/25/22

Keywords

  • Austronesian languages
  • code-switching detection
  • language identification
  • low-resource languages

ASJC Scopus subject areas

  • Language and Linguistics
  • Library and Information Sciences
  • Linguistics and Language
  • Education

Fingerprint

Dive into the research topics of 'Language Identification for Austronesian Languages'. Together they form a unique fingerprint.

Cite this