An Ensemble Multi-label Themes-Based Classification for Holy Qur'an Verses Using Word2Vec Embedding

被引:4
|
作者
Mohamed, Ensaf Hussein [1 ]
El-Behaidy, Wessam H. [1 ]
机构
[1] Helwan Univ, Fac Comp & Artificial Intelligence, Cairo, Egypt
关键词
Multi-label classification; Holy Quran; Arabic NLP; Machine learning; Word2vec; TF-IDF;
D O I
10.1007/s13369-020-05184-0
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Automatic themes-based classification of Quran verses is the process of classifying verses to predefined categorizes or themes. It is an essential task for all Muslims and people interested in studying the Quran. Quran themes-based classification could be used in many natural language processing (NLP) fields such as search engines, data mining, question-answering systems, and information retrieval applications. This paper presents an ensemble multi-label classification model that automatically identifies and classifies the Quran verses based on themes/topics. The model is composed of four phases: pre-processing, data vectorization, binary relevance classifier, and voting module. Firstly, the verses of the second chapter of the Quran (Al-Baqarah) are tokenized and normalized. Then, the topics of these verses are manually labeled based on "Mushaf Al-Tajweed" classification. Secondly, verses are converted into features' vectors using term frequency-inverse document frequency (TF-IDF) and word2vec techniques. Word2vec is used to consider the semantic meaning of Quranic words and to improve performance. Also, they are trained on a collected classic Arabic corpus of 200 million words. Then, the binary relevance multi-label classification technique is applied using three different classifiers: logistic regression, support vector machine, and random forest, which categorize verses into 393 topics/tags. Finally, the voting module is applied by picking the tags with the maximum prediction probability among the three classifiers. The results of the three classifiers and the ensemble model are compared against "Mushaf Al-Tajweed." The ensemble model outperforms the three classifiers. Its average hamming loss, recall, precision, and F1-Score are 0.224, 81%, 75%, and 77%, respectively.
引用
收藏
页码:3519 / 3529
页数:11
相关论文
共 50 条
  • [21] Impact of preprocessing and word embedding on extreme multi-label patent classification tasks
    Guik Jung
    Junghoon Shin
    Sangjun Lee
    Applied Intelligence, 2023, 53 : 4047 - 4062
  • [22] A Multi-label Classification of Disaster-Related Tweets with Enhanced Word Embedding Ensemble Convolutional Neural Network Model
    Arathi E.
    Sasikala S.
    Informatica (Slovenia), 2022, 46 (07): : 131 - 144
  • [23] Multi-label text classification model based on semantic embedding
    Yan Danfeng
    Ke Nan
    Gu Chao
    Cui Jianfei
    Ding Yiqi
    TheJournalofChinaUniversitiesofPostsandTelecommunications, 2019, 26 (01) : 95 - 104
  • [24] Multi-label Classification of Small Samples Using an Ensemble Technique
    Mahdavi-Shahri, Amirreza
    Karimian, Jamil
    Javadi, Azadeh
    Houshmand, Mahboobeh
    26TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE 2018), 2018, : 1708 - 1713
  • [25] An Efficient Multi-Label Classification System Using Ensemble of Classifiers
    Chandran, Shilpa A.
    Panicker, Janu R.
    2017 INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING, INSTRUMENTATION AND CONTROL TECHNOLOGIES (ICICICT), 2017, : 1133 - 1136
  • [26] Multilabel classification using heterogeneous ensemble of multi-label classifiers
    Tahir, Muhammad Atif
    Kittler, Josef
    Bouridane, Ahmed
    PATTERN RECOGNITION LETTERS, 2012, 33 (05) : 513 - 523
  • [27] Classification Bullying Tweet Using Convolutional Neural Network with Word2vec
    Ricko
    Sasongko, Priyo Sidik
    2021 5TH INTERNATIONAL CONFERENCE ON INFORMATICS AND COMPUTATIONAL SCIENCES (ICICOS 2021), 2021,
  • [28] Text classification model based on Word2vec and SF-HAN
    Li, Zhien
    Rao, Zhuyi
    PROCEEDINGS OF 2020 IEEE 5TH INFORMATION TECHNOLOGY AND MECHATRONICS ENGINEERING CONFERENCE (ITOEC 2020), 2020, : 1385 - 1390
  • [29] Multi-label classification of legal text based on label embedding and capsule network
    Chen, Zhe
    Li, Shang
    Ye, Lin
    Zhang, Hongli
    APPLIED INTELLIGENCE, 2023, 53 (06) : 6873 - 6886
  • [30] Multi-label classification of legal text based on label embedding and capsule network
    Zhe Chen
    Shang Li
    Lin Ye
    Hongli Zhang
    Applied Intelligence, 2023, 53 : 6873 - 6886