An Ensemble Multi-label Themes-Based Classification for Holy Qur’an Verses Using Word2Vec Embedding

被引:0
|
作者
Ensaf Hussein Mohamed
Wessam H. El-Behaidy
机构
[1] Helwan University,Faculty of Computers and Artificial Intelligence
关键词
Multi-label classification; Holy Quran; Arabic NLP; Machine learning; Word2vec; TF-IDF;
D O I
暂无
中图分类号
学科分类号
摘要
Automatic themes-based classification of Quran verses is the process of classifying verses to predefined categorizes or themes. It is an essential task for all Muslims and people interested in studying the Quran. Quran themes-based classification could be used in many natural language processing (NLP) fields such as search engines, data mining, question–answering systems, and information retrieval applications. This paper presents an ensemble multi-label classification model that automatically identifies and classifies the Quran verses based on themes/topics. The model is composed of four phases: pre-processing, data vectorization, binary relevance classifier, and voting module. Firstly, the verses of the second chapter of the Quran (Al-Baqarah) are tokenized and normalized. Then, the topics of these verses are manually labeled based on “Mushaf Al-Tajweed” classification. Secondly, verses are converted into features’ vectors using term frequency-inverse document frequency (TF-IDF) and word2vec techniques. Word2vec is used to consider the semantic meaning of Quranic words and to improve performance. Also, they are trained on a collected classic Arabic corpus of 200 million words. Then, the binary relevance multi-label classification technique is applied using three different classifiers: logistic regression, support vector machine, and random forest, which categorize verses into 393 topics/tags. Finally, the voting module is applied by picking the tags with the maximum prediction probability among the three classifiers. The results of the three classifiers and the ensemble model are compared against “Mushaf Al-Tajweed.” The ensemble model outperforms the three classifiers. Its average hamming loss, recall, precision, and F1-Score are 0.224, 81%, 75%, and 77%, respectively.
引用
收藏
页码:3519 / 3529
页数:10
相关论文
共 50 条
  • [11] KEYWORD EXTRACTION BASED ON WORD SYNONYMS USING WORD2VEC
    Ogul, Iskender Ulgen
    Ozcan, Caner
    Hakdagli, Ozlem
    2019 27TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2019,
  • [12] Dynamic ensemble pruning based on multi-label classification
    Markatopoulou, Fotini
    Tsoumakas, Grigorios
    Vlahavas, Ioannis
    NEUROCOMPUTING, 2015, 150 : 501 - 512
  • [13] Text Classification Based on Word2vec and Convolutional Neural Network
    Li, Lin
    Xiao, Linlong
    Jin, Wenzhen
    Zhu, Hong
    Yang, Guocai
    NEURAL INFORMATION PROCESSING (ICONIP 2018), PT V, 2018, 11305 : 450 - 460
  • [14] Turkish Document Classification Based on Word2Vec and SVM Classifier
    Sahin, Gurkan
    2017 25TH SIGNAL PROCESSING AND COMMUNICATIONS APPLICATIONS CONFERENCE (SIU), 2017,
  • [15] Text Classification Research Based on Improved Word2vec and CNN
    Gao, Mengyuan
    Li, Tinghui
    Huang, Peifang
    SERVICE-ORIENTED COMPUTING, ICSOC 2018, 2019, 11434 : 126 - 135
  • [16] Chinese comments sentiment classification based on word2vec and SVMperf
    Zhang, Dongwen
    Xu, Hua
    Su, Zengcai
    Xu, Yunfeng
    EXPERT SYSTEMS WITH APPLICATIONS, 2015, 42 (04) : 1857 - 1863
  • [17] Diet Health Text Classification Based on word2vec and LSTM
    Zhao M.
    Du H.
    Dong C.
    Chen C.
    Nongye Jixie Xuebao/Transactions of the Chinese Society for Agricultural Machinery, 2017, 48 (10): : 202 - 208
  • [18] Text classification based on word2vec and convolutional neural networks
    Fan, Xiaojing
    Jiang, Mingyang
    Pei, Zhili
    BASIC & CLINICAL PHARMACOLOGY & TOXICOLOGY, 2019, 125 : 77 - 78
  • [19] Research on patent text classification based on Word2Vec and LSTM
    Xiao, Lizhong
    Wang, Guangzhong
    Zuo, Yang
    2018 11TH INTERNATIONAL SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE AND DESIGN (ISCID), VOL 1, 2018, : 71 - 74
  • [20] Impact of preprocessing and word embedding on extreme multi-label patent classification tasks
    Jung, Guik
    Shin, Junghoon
    Lee, Sangjun
    APPLIED INTELLIGENCE, 2023, 53 (04) : 4047 - 4062