Mining the web for the induction of a dialectical Arabic lexicon

Rania Al-Sabbagh, Roxana Girju

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) - one of the most widely understood dialects in the Arab World - and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to MSA available NLP tools and resources which are considerably available. Using an associationist approach based on the correlations between word co-occurrence patterns in both dialects, we change the direction of the acquisition process from parallel to circular to overcome a bottleneck of current research on Arabic dialects, namely the lack of parallel corpora, and to alleviate accuracy rates for using unrelated Web documents which are more frequently available. Manually evaluated for 1,000 word entries by two native speakers of the ECA-MSA varieties, the proposed approach achieves a promising F-measured performance rate of 70.9%. In discussion to the proposed algorithm, different semantic issues are highlighted for upcoming phases of the induction of a more comprehensive ECA-MSA lexicon.

Original languageEnglish (US)
Title of host publicationProceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
EditorsDaniel Tapias, Irene Russo, Olivier Hamon, Stelios Piperidis, Nicoletta Calzolari, Khalid Choukri, Joseph Mariani, Helene Mazo, Bente Maegaard, Jan Odijk, Mike Rosner
PublisherEuropean Language Resources Association (ELRA)
Pages288-293
Number of pages6
ISBN (Electronic)2951740867, 9782951740860
StatePublished - Jan 1 2010
Event7th International Conference on Language Resources and Evaluation, LREC 2010 - Valletta, Malta
Duration: May 17 2010May 23 2010

Publication series

NameProceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010

Other

Other7th International Conference on Language Resources and Evaluation, LREC 2010
CountryMalta
CityValletta
Period5/17/105/23/10

    Fingerprint

ASJC Scopus subject areas

  • Education
  • Library and Information Sciences
  • Linguistics and Language
  • Language and Linguistics

Cite this

Al-Sabbagh, R., & Girju, R. (2010). Mining the web for the induction of a dialectical Arabic lexicon. In D. Tapias, I. Russo, O. Hamon, S. Piperidis, N. Calzolari, K. Choukri, J. Mariani, H. Mazo, B. Maegaard, J. Odijk, & M. Rosner (Eds.), Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010 (pp. 288-293). (Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010). European Language Resources Association (ELRA).