TY - GEN
T1 - Mining the web for the induction of a dialectical Arabic lexicon
AU - Al-Sabbagh, Rania
AU - Girju, Roxana
N1 - Copyright:
Copyright 2017 Elsevier B.V., All rights reserved.
PY - 2010
Y1 - 2010
N2 - This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) - one of the most widely understood dialects in the Arab World - and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to MSA available NLP tools and resources which are considerably available. Using an associationist approach based on the correlations between word co-occurrence patterns in both dialects, we change the direction of the acquisition process from parallel to circular to overcome a bottleneck of current research on Arabic dialects, namely the lack of parallel corpora, and to alleviate accuracy rates for using unrelated Web documents which are more frequently available. Manually evaluated for 1,000 word entries by two native speakers of the ECA-MSA varieties, the proposed approach achieves a promising F-measured performance rate of 70.9%. In discussion to the proposed algorithm, different semantic issues are highlighted for upcoming phases of the induction of a more comprehensive ECA-MSA lexicon.
AB - This paper describes the first phase of building a lexicon of Egyptian Cairene Arabic (ECA) - one of the most widely understood dialects in the Arab World - and Modern Standard Arabic (MSA). Each ECA entry is mapped to its MSA synonym, Part-of-Speech (POS) tag and top-ranked contexts based on Web queries; and thus each entry is provided with basic syntactic and semantic information for a generic lexicon compatible with multiple NLP applications. Moreover, through their MSA synonyms, ECA entries acquire access to MSA available NLP tools and resources which are considerably available. Using an associationist approach based on the correlations between word co-occurrence patterns in both dialects, we change the direction of the acquisition process from parallel to circular to overcome a bottleneck of current research on Arabic dialects, namely the lack of parallel corpora, and to alleviate accuracy rates for using unrelated Web documents which are more frequently available. Manually evaluated for 1,000 word entries by two native speakers of the ECA-MSA varieties, the proposed approach achieves a promising F-measured performance rate of 70.9%. In discussion to the proposed algorithm, different semantic issues are highlighted for upcoming phases of the induction of a more comprehensive ECA-MSA lexicon.
UR - http://www.scopus.com/inward/record.url?scp=85037146021&partnerID=8YFLogxK
UR - http://www.scopus.com/inward/citedby.url?scp=85037146021&partnerID=8YFLogxK
M3 - Conference contribution
AN - SCOPUS:85037146021
T3 - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
SP - 288
EP - 293
BT - Proceedings of the 7th International Conference on Language Resources and Evaluation, LREC 2010
A2 - Tapias, Daniel
A2 - Russo, Irene
A2 - Hamon, Olivier
A2 - Piperidis, Stelios
A2 - Calzolari, Nicoletta
A2 - Choukri, Khalid
A2 - Mariani, Joseph
A2 - Mazo, Helene
A2 - Maegaard, Bente
A2 - Odijk, Jan
A2 - Rosner, Mike
PB - European Language Resources Association (ELRA)
T2 - 7th International Conference on Language Resources and Evaluation, LREC 2010
Y2 - 17 May 2010 through 23 May 2010
ER -