ZAEBUC زئـــــــــبق

Zayed Arabic-English Bilingual Undergraduate Corpus 

Mission: an open bilingual user corpus for research

The Zayed Arabic-English Bilingual Undergraduate Corpus (ZAEBUC) is a new kind of corpus, which focuses on a large set of bilingual writers and comprises samples of their writing in both their languages. It is estimated that more than half the world’s population use more than one language every day (Grosjean, 2010); and many of these people are literate to some level in more than one language. However, corpus-based research has tended to focus on one language and one community of writers at a time. For example, research on ‘learner corpora’ of writing in English compares this writing with a corpus of writing by other, ‘native’ users of the same language. 

In contrast to a ‘parallel corpus’, which pairs texts in one language with translations of those same texts into another language, ZAEBUC is a bilingual writer corpus, matching comparable texts in different languages written by the same writer on different occasions. It currently comprises short essays written by several hundred (mainly Emirati) Freshman students; in total, the corpus currently consists of 388 English essays (~88,000 words) and 214 Arabic essays (~33,000 words).

The corpus is provided in uncorrected and corrected versions, so that errors in spelling and basic sentence grammar can be identified and analyzed. Both Arabic and English texts are also rated by three assessors using the Common European Framework of Reference (CEFR; Council of Europe, 2001). Additionally, the corpus is automatically and manually annotated for part of speech, lemmas and other features. We followed commonly used standards for tokenization, tagging and lemmatizations for Arabic and English to allow the use of the corpus in computational (Marcus et al, 1993; Maamouri et al, 2004). In particular, we used the Universal Dependencies part-of-speech standards as they are designed to maximize comparability between languages (Nivre et al., 2016). Finally, metadata about each writer/text enables researchers to compare subcorpora. The corpus will be made available in a number of formats to accommodate different research communities’ needs, from basic TSV and TXT files to interfaces supported by SketchEngine (Kilgarriff et al., 2014).

ZAEBUC will be an open research resource, aligned with the recent ‘multilingual turn’ in applied linguistics. It will enable researchers to answer a range of questions such as “Do students who use more complex constructions in their Arabic writing also tend to use more complex constructions in their English writing?”; “Do male students use different vocabulary from female students when writing in Arabic? And is a similar pattern evident when writing in English?”; or “Is Arabic clearly dominant compared to English in students who studied at an Arabic-medium high school? And is the inverse pattern evident for students who studied at an English-medium high school?”.


Downloads

To download the corpus, click here.


Research Team

David M. Palfreyman (UAE University)

Nizar Habash (New York University Abu Dhabi, CAMeL Lab)


Acknowledgements

The creators of this corpus acknowledge the support of this project from the Zayed University Research Incentive Fund (award R19068).

We also extend thanks to Ramy Eskander for helpful discussions and the team of annotators at Ramitechs for their help in creating this resource.


Publications

ZAEBUC: An Annotated Arabic-English Bilingual Writer Corpus. Nizar Habash & David Palfreyman. In Proceedings of the 13th Conference on Language Resources and Evaluation (LREC 2022), pp. 79-88, Marseille, 2022.  [PDF]

Bilingual Writers and Corpus Analysis. David M. Palfreyman & Nizar Habash (Eds., 2022). Routledge.


References

Council of Europe (2001). Common European Framework of Reference for Languages: learning, teaching, assessment. Cambridge University Press.

Grosjean, F. (2010). Bilingual. Harvard University Press.

Kilgarriff, A., Baisa, V., Bušta, J., Jakubíček, M., Kovář, V., Michelfeit, J., ... & Suchomel, V. (2014). The Sketch Engine: ten years on. Lexicography, 1(1), 7-36.

Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004, September). The Penn Arabic Treebank: Building a large-scale annotated arabic corpus. In NEMLAR conference on Arabic language resources and tools (Vol. 27, pp. 466-467).

Marcus, M., Santorini, B., & Marcinkiewicz, M. A. (1993). Building a large annotated corpus of English: The Penn Treebank.

Nivre, Joakim, Marie-Catherine De Marneffe, Filip Ginter, Yoav Goldberg, Jan Hajic, Christopher D. Manning, Ryan McDonald et al. "Universal dependencies v1: A multilingual treebank ." In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16), pp. 1659-1666. 2016.

Zayed University logo
CAMeL Lab logo
NYU Abu Dhabi logo