Arabic text categorization, Deep learning, Word embedding, Arabic corpus
Computer Sciences | Data Science | Science and Technology Studies
Over the last years, Natural Language Processing (NLP) for Arabic language has obtained increasing importance due to the massive textual information available online in an unstructured text format, and its capability in facilitating and making information retrieval easier. One of the widely used NLP task is “Text Classification”. Its goal is to employ machine learning technics to automatically classify the text documents into one or more predefined categories. An important step in machine learning is to find suitable and large data for training and testing an algorithm. Moreover, Deep Learning (DL), the trending machine learning research, requires a lot of data and needs to be trained with several different and challenging datasets to perform to its best. Currently, there are few available corpora used in Arabic text categorization research. These corpora are small and some of them are unbalanced or contains redundant data. In this paper, a new voluminous Arabic corpus is proposed. This corpus is collected from 16 Arabic online news portals using an automated web crawling process. Two versions are available: the first is imbalanced and contains 3252934 articles distributed into 8 predefined categories. This version can be used to generate Arabic word embedding; the second is balanced and contains 720000 articles also distributed into 8 predefined categories with 90000 each. It can be used in Arabic text classification research. The corpus can be made available for research purpose upon request. Two experiments were conducted to show the impact of dataset size and the use of word2vec pre-trained word embedding on the performance of Arabic text classification using deep learning model.
Abou Khachfeh, Roua A.; El Kabani, Islam; and Osman, Ziad
"A NOVEL ARABIC CORPUS FOR TEXT CLASSIFICATION USING DEEP LEARNING AND WORD EMBEDDING,"
BAU Journal - Science and Technology: Vol. 3
, Article 10.
Available at: https://digitalcommons.bau.edu.lb/stjournal/vol3/iss1/10