Arabic NLP: A Survey of Pre-Processing and Representation Techniques

Hussein Ala'a Alkaabi, Ali kadhim Jasim, Ali Darroudi

Abstract


The rapid growth of Arabic Natural Language Processing (NLP) has underscored the vital role of upstream tasks that prepare raw text for modeling. This review systematically examines the key steps in Arabic text pre-processing and representation learning, highlighting their impact on downstream NLP performance. We discuss the unique linguistic challenges posed by Arabic, such as rich morphology, orthographic ambiguity, dialectal diversity, and code-switching phenomena. The survey covers traditional rule-based and statistical methods and modern deep learning approaches, including subword tokenization and contextual embeddings. Special attention is given to how pre-trained language models like AraBERT and MARBERT interact with pre-processing pipelines, often redefining the balance between explicit text normalization and implicit representation learning. Furthermore, we analyze existing tools, benchmarks, and evaluation metrics, and identify persistent gaps such as dialect adaptation and Romanized Arabic (Arabizi) processing. By mapping current practices and open issues, this review aims to guide researchers and practitioners towards more robust, adaptive, and linguistically-aware Arabic NLP pipelines, ensuring that the data fed into models is as clean, consistent, and semantically meaningful as possible.


Keywords


Arabic NLP, Pre-processing, Morphological Analysis, Dialectal Arabic, Deep Learning.

Full Text:

PDF

References


Salloum, S. A., AlHamad, A. Q., Al-Emran, M., & Shaalan, K. (2018). A survey of Arabic text mining. Intelligent natural language processing: Trends and applications, 417-431.‏

Stahlberg, F. (2020). Neural machine translation: A review. Journal of Artificial Intelligence Research, 69, 343-418.‏

Alnawas, A., & Arici, N. (2019). Sentiment analysis of Iraqi Arabic dialect on Facebook based on distributed representations of documents. ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 18(3), 1-17.‏

Derakhshi, M. R. F., Zafarani-Moattar, E., Al-Kabi, H. A. A., & Almarashy, A. H. J. (2024). Pclf: parallel cnn-lstm fusion model for sms spam filtering. In BIO Web of Conferences (Vol. 97, p. 00136). EDP Sciences.

Muaad, A. Y., Davanagere, H. J., Guru, D. S., Benifa, J. B., Chola, C., AlSalman, H., ... & Al-antari, M. A. (2022). Arabic document classification: performance investigation of preprocessing and representation techniques. Mathematical Problems in Engineering, 2022(1), 3720358.‏

Alotaiby, F., Foda, S., & Alkharashi, I. (2014). Arabic vs. English: Comparative statistical study. Arabian Journal for Science and Engineering, 39, 809-820.‏

Diab, M., Hacioglu, K., & Jurafsky, D. (2007). Automatic processing of modern standard Arabic text. In Arabic Computational Morphology: Knowledge-based and Empirical Methods (pp. 159-179). Dordrecht: Springer Netherlands.‏

Abdul-Mageed, M., Elmadany, A., & Nagoudi, E. M. B. (2020). ARBERT & MARBERT: Deep bidirectional transformers for Arabic. arXiv preprint arXiv:2101.01785.‏

Antoun, W., Baly, F., & Hajj, H. (2020). AraGPT2: Pre-trained transformer for Arabic language generation. arXiv preprint arXiv:2012.15520.‏

Elnagar, A., Yagi, S. M., Nassif, A. B., Shahin, I., & Salloum, S. A. (2021). Systematic literature review of dialectal Arabic: identification and detection. IEEE Access, 9, 31010-31042.‏

Guellil, I., Saâdane, H., Azouaou, F., Gueni, B., & Nouvel, D. (2021). Arabic natural language processing: An overview. Journal of King Saud University-Computer and Information Sciences, 33(5), 497-507.‏

Nafea, A. A., Muayad, M. S., Majeed, R. R., Ali, A., Bashaddadh, O. M., Khalaf, M. A., ... & Steiti, A. (2024). A Brief Review on Preprocessing Text in Arabic Language Dataset: Techniques and Challenges. Babylonian Journal of Artificial Intelligence, 2024, 46-53.‏

Issa, I. (2023). Morphological complexity in Arabic spelling and its implication for cognitive processing. Journal of Psycholinguistic Research, 52(1), 331-357.‏

Elnagar, A., Yagi, S. M., Nassif, A. B., Shahin, I., & Salloum, S. A. (2021). Systematic literature review of dialectal Arabic: identification and detection. IEEE Access, 9, 31010-31042.‏

Boumaraf, A., Bekal, S., & Macoir, J. (2022). The Orthographic ambiguity of the Arabic Graphic System: evidence from a case of Central Agraphia affecting the two routes of Spelling. Behavioural Neurology, 2022(1), 8078607.‏

Allehaiby, W. H. (2013). Arabizi: An Analysis of the Romanization of the Arabic Script from a Sociolinguistic Perspective. Arab World English Journal, 4(3).‏

Bentahila, A., & Davies, E. E. (1995). Patterns of code-switching and patterns of language contact. Lingua, 96(2-3), 75-93.‏

Ghomri, T., & Souadkia, M. (2020). An analytical study of word-order patterns in Standard Arabic simple sentence. RUDN Journal of Language Studies, Semiotics and Semantics, 11(1), 78-91.‏

Al-Kabbi, H. A., Feizi-Derakhshi, M. R., & Pashazadeh, S. (2024). A Hierarchical Two-Level Feature Fusion Approach for SMS Spam Filtering. Intelligent Automation & Soft Computing, 39(4).‏

Attia, M. (2007, June). Arabic tokenization system. In Proceedings of the 2007 workshop on computational approaches to semitic languages: Common issues and resources (pp. 65-72).‏

Chennafi, M. E., Bedlaoui, H., Dahou, A., & Al-qaness, M. A. (2022). Arabic aspect-based sentiment classification using Seq2Seq dialect normalization and transformers. Knowledge, 2(3), 388-401.‏

Zeroual, I., & Lakhouaja, A. (2017, April). Arabic information retrieval: Stemming or lemmatization?. In 2017 Intelligent Systems and Computer Vision (ISCV) (pp. 1-6). IEEE.‏

Namly, D., Bouzoubaa, K., El Jihad, A., & Aouragh, S. L. (2020). Improving Arabic lemmatization through a lemmas database and a machine-learning technique. Recent Advances in NLP: The Case of Arabic Language, 81-100.‏

Kaur, J., & Buttar, P. K. (2018). A systematic review on stopword removal algorithms. International Journal on Future Revolution in Computer Science & Communication Engineering, 4(4), 207-210.‏

Zeroual, I., Lakhouaja, A., & Belahbib, R. (2017). Towards a standard Part of Speech tagset for the Arabic language. Journal of King Saud University-Computer and Information Sciences, 29(2), 171-178.‏

Pasha, A., Al-Badrashiny, M., Diab, M. T., El Kholy, A., Eskander, R., Habash, N., ... & Roth, R. (2014, May). Madamira: A fast, comprehensive tool for morphological analysis and disambiguation of arabic. In Lrec (Vol. 14, No. 2014, pp. 1094-1101).‏

Abdelali, A., Darwish, K., Durrani, N., & Mubarak, H. (2016, June). Farasa: A fast and furious segmenter for arabic. In Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: Demonstrations (pp. 11-16).‏

Green, S., & Manning, C. D. (2010, August). Better Arabic parsing: Baselines, evaluations, and analysis. In Proceedings of the 23rd International Conference on Computational Linguistics (Coling 2010) (pp. 394-402).

‏Obeid, O., Zalmout, N., Khalifa, S., Taji, D., Oudah, M., Alhafni, B., ... & Habash, N. (2020, May). CAMeL tools: An open source python toolkit for Arabic natural language processing. In Proceedings of the twelfth language resources and evaluation conference (pp. 7022-7032).‏

Antoun, W., Baly, F., & Hajj, H. (2020). Arabert: Transformer-based model for arabic language understanding. arXiv preprint arXiv:2003.00104.‏

Al-Kaabi, H., Al-Ibraheemi, F., Jasim, A. K., & AL-Rekabi, M. (2025). Fusion-Based Hybrid Model for SMS Spam Detection Integrating Local, Sequential, and Contextual Features.‏

Guo, J., Tiwari, G., Droppo, J., Van Segbroeck, M., Huang, C. W., Stolcke, A., & Maas, R. (2020). Efficient minimum word error rate training of rnn-transducer for end-to-end speech recognition. arXiv preprint arXiv:2007.13802.‏

El-Shishtawy, T., & El-Ghannam, F. (2012). An accurate arabic root-based lemmatizer for information retrieval purposes. arXiv preprint arXiv:1203.3584.‏

Maamouri, M., Bies, A., Buckwalter, T., & Mekki, W. (2004). The Penn Arabic Treebank: Building a large-scale annotated Arabic corpus. In Proceedings of the NEMLAR Conference on Arabic Language Resources and Tools.

Nivre, J., de Marneffe, M.-C., Ginter, F., et al. (2016). Universal Dependencies v1: A multilingual treebank collection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016).

Mubarak, H., Darwish, K., & Oflazer, K. (2014). Arap-Tweet: A large multi-dialect Twitter corpus for gender, age, and language variety identification. In Proceedings of the EMNLP Workshop on Arabic Natural Language Processing (ANLP).

Darwish, K. (2013). Building a shallow Arabic morphological analyzer in one day. In Proceedings of the NAACL-HLT Workshop on Arabic Natural Language Processing.

El-Khair, I. A. (2016). The impact of morphological segmentation on Arabic machine translation. Journal of King Saud University - Computer and Information Sciences, 28(1), 42-49.

Habash, N., & Rambow, O. (2005). Arabic tokenization, part-of-speech tagging and morphological disambiguation in one fell swoop. In Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005).

Nabil, M., Aly, M., & Atiya, A. (2015, September). Astd: Arabic sentiment tweets dataset. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 2515-2519).‏

AlYami, R., & AlZaidy, R. (2020, March). Arabic dialect identification in social media. In 2020 3rd International Conference on Computer Applications & Information Security (ICCAIS) (pp. 1-2). IEEE.‏

Brabetz, G. (2022). Arabizi: A Linguistic Manifestation of Glocalization in the Arabic Language Area. Maydan: rivista sui mondi arabi, semitici e islamici, 2, 103-129.‏

Nashef, H. A. (2013). , hello and bonjour: a postcolonial analysis of Arab media's use of code switching and mixing and its ramification on the identity of the self in the Arab world. International Journal of Multilingualism, 10(3), 313-330.‏

Elnagar, A., Yagi, S. M., Nassif, A. B., Shahin, I., & Salloum, S. A. (2021). Systematic literature review of dialectal Arabic: identification and detection. IEEE Access, 9, 31010-31042.‏

Albalawi, Y., Buckley, J., & Nikolov, N. S. (2021). Investigating the impact of pre-processing techniques and pre-trained word embeddings in detecting Arabic health information on social media. Journal of big Data, 8(1), 95.‏

Hussein Alkaabi, Fuqdan Ibraheemi, Ali Jasim et al. Arabic SMS Spam Detection Using AraBERT and Dual Feature Extraction: A Study on Modern Standard and Iraqi Dialects, 09 June 2025, PREPRINT (Version 1) available at Research Square [https://doi.org/10.21203/rs.3.rs-6832100/v1]

El Mekki, A., El Mahdaouy, A., Berrada, I., & Khoumsi, A. (2021, June). Domain adaptation for Arabic cross-domain and cross-dialect sentiment analysis from contextualized word embedding. In Proceedings of the 2021 conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies (pp. 2824-2837).‏

Arora, S., Thota, S. R., & Gupta, S. (2024, August). Data Mining and Processing in the Age of Big Data and Artificial Intelligence-Issues, Privacy, and Ethical Considerations. In 2024 4th Asian Conference on Innovation in Technology (ASIANCON) (pp. 1-6). IEEE.‏




DOI: https://doi.org/10.30596/jcositte.v6i2.25562

Refbacks

  • There are currently no refbacks.