YADAC: Yet another dialectal Arabic corpus

Rania Al-Sabbagh, Roxana Girju

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper presents the first phase of building YADAC - a multi-genre Dialectal Arabic (DA) corpus - that is compiled using Web data from microblogs (i.e. Twitter), blogs/forums and online knowledge market services in which both questions and answers are user-generated. In addition to introducing two new genres to the current efforts of building DA corpora (i.e. microblogs and question-answer pairs extracted from online knowledge market services), the paper highlights and tackles several new issues related to building DA corpora that have not been handled in previous studies: function-based Web harvesting and dialect identification, vowel-based spelling variation, linguistic hypercorrection and its effect on spelling variation, unsupervised Part-of-Speech (POS) tagging and base phrase chunking for DA. Although the algorithms for both POS tagging and base-phrase chunking are still under development, the results are promising.

Original languageEnglish (US)
Title of host publicationProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012
EditorsMehmet Ugur Dogan, Joseph Mariani, Asuncion Moreno, Sara Goggi, Khalid Choukri, Nicoletta Calzolari, Jan Odijk, Thierry Declerck, Bente Maegaard, Stelios Piperidis, Helene Mazo, Olivier Hamon
PublisherEuropean Language Resources Association (ELRA)
Pages2882-2889
Number of pages8
ISBN (Electronic)9782951740877
StatePublished - Jan 1 2012
Event8th International Conference on Language Resources and Evaluation, LREC 2012 - Istanbul, Turkey
Duration: May 21 2012May 27 2012

Publication series

NameProceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012

Other

Other8th International Conference on Language Resources and Evaluation, LREC 2012
CountryTurkey
CityIstanbul
Period5/21/125/27/12

    Fingerprint

Keywords

  • Dialect Identification
  • Dialectal Arabic
  • POS tagging

ASJC Scopus subject areas

  • Linguistics and Language
  • Language and Linguistics
  • Education
  • Library and Information Sciences

Cite this

Al-Sabbagh, R., & Girju, R. (2012). YADAC: Yet another dialectal Arabic corpus. In M. U. Dogan, J. Mariani, A. Moreno, S. Goggi, K. Choukri, N. Calzolari, J. Odijk, T. Declerck, B. Maegaard, S. Piperidis, H. Mazo, & O. Hamon (Eds.), Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012 (pp. 2882-2889). (Proceedings of the 8th International Conference on Language Resources and Evaluation, LREC 2012). European Language Resources Association (ELRA).