An Ensemble Multi-label Themes-Based Classification for Holy Qur'an Verses Using Word2Vec Embedding

被引:4
|
作者
Mohamed, Ensaf Hussein [1 ]
El-Behaidy, Wessam H. [1 ]
机构
[1] Helwan Univ, Fac Comp & Artificial Intelligence, Cairo, Egypt
关键词
Multi-label classification; Holy Quran; Arabic NLP; Machine learning; Word2vec; TF-IDF;
D O I
10.1007/s13369-020-05184-0
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Automatic themes-based classification of Quran verses is the process of classifying verses to predefined categorizes or themes. It is an essential task for all Muslims and people interested in studying the Quran. Quran themes-based classification could be used in many natural language processing (NLP) fields such as search engines, data mining, question-answering systems, and information retrieval applications. This paper presents an ensemble multi-label classification model that automatically identifies and classifies the Quran verses based on themes/topics. The model is composed of four phases: pre-processing, data vectorization, binary relevance classifier, and voting module. Firstly, the verses of the second chapter of the Quran (Al-Baqarah) are tokenized and normalized. Then, the topics of these verses are manually labeled based on "Mushaf Al-Tajweed" classification. Secondly, verses are converted into features' vectors using term frequency-inverse document frequency (TF-IDF) and word2vec techniques. Word2vec is used to consider the semantic meaning of Quranic words and to improve performance. Also, they are trained on a collected classic Arabic corpus of 200 million words. Then, the binary relevance multi-label classification technique is applied using three different classifiers: logistic regression, support vector machine, and random forest, which categorize verses into 393 topics/tags. Finally, the voting module is applied by picking the tags with the maximum prediction probability among the three classifiers. The results of the three classifiers and the ensemble model are compared against "Mushaf Al-Tajweed." The ensemble model outperforms the three classifiers. Its average hamming loss, recall, precision, and F1-Score are 0.224, 81%, 75%, and 77%, respectively.
引用
收藏
页码:3519 / 3529
页数:11
相关论文
共 50 条
  • [1] An Ensemble Multi-label Themes-Based Classification for Holy Qur’an Verses Using Word2Vec Embedding
    Ensaf Hussein Mohamed
    Wessam H. El-Behaidy
    Arabian Journal for Science and Engineering, 2021, 46 : 3519 - 3529
  • [2] Multi-Label Chinese Question Classification Based on Word2vec
    Fan, Zhengyu
    Su, Lei
    Liu, Xi
    Wang, Shuaiyang
    2017 4TH INTERNATIONAL CONFERENCE ON SYSTEMS AND INFORMATICS (ICSAI), 2017, : 546 - 550
  • [3] Multi-label classification of research articles using Word2Vec and identification of similarity threshold
    Ghulam Mustafa
    Muhammad Usman
    Lisu Yu
    Muhammad Tanvir afzal
    Muhammad Sulaiman
    Abdul Shahid
    Scientific Reports, 11
  • [4] Multi-label classification of research articles using Word2Vec and identification of similarity threshold
    Mustafa, Ghulam
    Usman, Muhammad
    Yu, Lisu
    Afzal, Muhammad Tanvir
    Sulaiman, Muhammad
    Shahid, Abdul
    SCIENTIFIC REPORTS, 2021, 11 (01)
  • [5] Multi-label Sentence Classification Using Bengali Word Embedding Model
    Hasan, Md. Nowshad
    Bhowmik, Sourav
    Rahaman, Md. Mahfuzur
    2017 3RD INTERNATIONAL CONFERENCE ON ELECTRICAL INFORMATION AND COMMUNICATION TECHNOLOGY (EICT 2017), 2017,
  • [6] Multi-label classification using hierarchical embedding
    Kumar, Vikas
    Pujari, Arun K.
    Padmanabhan, Vineet
    Sahu, Sandeep Kumar
    Kagita, Venkateswara Rao
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 91 : 263 - 269
  • [7] Chinese Sentiment Classification Using Extended Word2Vec
    张胜
    张鑫
    程佳军
    王晖
    Journal of Donghua University(English Edition), 2016, 33 (05) : 823 - 826
  • [8] Research on Chinese Text Classification Based on Word2vec
    Yang, Zhi-Tong
    Zheng, Jun
    2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2016, : 1166 - 1170
  • [9] Microblogging Short Text Classification based on Word2Vec
    Zhang, Yonghui
    Liu, Jingang
    PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON ELECTRONIC, MECHANICAL, INFORMATION AND MANAGEMENT SOCIETY (EMIM), 2016, 40 : 395 - 401
  • [10] Short Text Classification Based on Wikipedia and Word2vec
    Liu Wensen
    Cao Zewen
    Wang Jun
    Wang Xiaoyi
    2016 2ND IEEE INTERNATIONAL CONFERENCE ON COMPUTER AND COMMUNICATIONS (ICCC), 2016, : 1195 - 1200