Abstract
This paper evaluates global-scale dialect identification for 14 national varieties of English on both web-crawled data and Twitter data. The paper makes three main contributions: (i) introducing data-driven language mapping as a method for selecting the inventory of national varieties to include in the task; (ii) producing a large and dynamic set of syntactic features using grammar induction rather than focusing on a few hand-selected features such as function words; and (iii) comparing models across both web corpora and social media corpora in order to measure the robustness of syntactic variation across registers.
| Original language | English (US) |
|---|---|
| Title of host publication | Proceedings of the Sixth Workshop on NLP for Similar Languages, Varieties and Dialects |
| Editors | Marcos Zampieri, Preslav Nakov, Shervin Malmasi, Nikola Ljubešić, Jörg Tiedemann, Ahmed Ali |
| Publisher | Association for Computational Linguistics |
| Pages | 42-53 |
| ISBN (Electronic) | 9781950737116 |
| DOIs | |
| State | Published - Jun 2019 |
| Externally published | Yes |