Arabic text categorization, Deep learning, Word embedding, Arabic corpus


Computer Sciences | Data Science | Science and Technology Studies


Over the last years, Natural Language Processing (NLP) for Arabic language has obtained increasing importance due to the massive textual information available online in an unstructured text format, and its capability in facilitating and making information retrieval easier. One of the widely used NLP task is “Text Classification”. Its goal is to employ machine learning technics to automatically classify the text documents into one or more predefined categories. An important step in machine learning is to find suitable and large data for training and testing an algorithm. Moreover, Deep Learning (DL), the trending machine learning research, requires a lot of data and needs to be trained with several different and challenging datasets to perform to its best. Currently, there are few available corpora used in Arabic text categorization research. These corpora are small and some of them are unbalanced or contains redundant data. In this paper, a new voluminous Arabic corpus is proposed. This corpus is collected from 16 Arabic online news portals using an automated web crawling process. Two versions are available: the first is imbalanced and contains 3252934 articles distributed into 8 predefined categories. This version can be used to generate Arabic word embedding; the second is balanced and contains 720000 articles also distributed into 8 predefined categories with 90000 each. It can be used in Arabic text classification research. The corpus can be made available for research purpose upon request. Two experiments were conducted to show the impact of dataset size and the use of word2vec pre-trained word embedding on the performance of Arabic text classification using deep learning model.





To view the content in your browser, please download Adobe Reader or, alternately,
you may Download the file to your hard drive.

NOTE: The latest versions of Adobe Reader do not support viewing PDF files within Firefox on Mac OS and if you are using a modern (Intel) Mac, there is no official plugin for viewing PDF files within the browser window.