Multi-label Arabic text categorization: A benchmark and baseline comparison of multi-label learning algorithms

被引:39
|
作者
Al-Salemi, Bassam [1 ]
Ayob, Masri [1 ]
Kendall, Graham [2 ]
Noah, Shahrul Azman Mohd [1 ]
机构
[1] Univ Kebangsaan Malaysia, Fac Informat Sci & Technol, Bangi, Selangor, Malaysia
[2] Univ Nottingham, Sch Comp Sci, Nottingham, England
关键词
Multi-label learning; Arabic text categorization; RTAnews; Multi-label benchmark; BOOSTING ALGORITHMS; FEATURE-SELECTION; RANKING;
D O I
10.1016/j.ipm.2018.09.008
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Multi-label text categorization refers to the problem of assigning each document to a subset of categories by means of multi-label learning algorithms. Unlike English and most other languages, the unavailability of Arabic benchmark datasets prevents evaluating multi-label learning algorithms for Arabic text categorization. As a result, only a few recent studies have dealt with multi-label Arabic text categorization on non-benchmark and inaccessible datasets. Therefore, this work aims to promote multi-label Arabic text categorization through (a) introducing "RTAnews", a new benchmark dataset of multi-label Arabic news articles for text categorization and other supervised learning tasks. The benchmark is publicly available in several formats compatible with the existing multi-label learning tools, such as MEKA and Mulan. (b) Conducting an extensive comparison of most of the well-known multi-label learning algorithms for Arabic text categorization in order to have baseline results and show the effectiveness of these algorithms for Arabic text categorization on RTAnews. The evaluation involves four multi-label transformation-based algorithms: Binary Relevance, Classifier Chains, Calibrated Ranking by Pairwise Comparison and Label Powerset, with three base learners (Support Vector Machine, k-Nearest Neighbors and Random Forest); and four adaptation-based algorithms (Multi-label kNN, Instance-Based Learning by Logistic Regression Multi-label, Binary Relevance kNN and RFBoost). The reported baseline results show that both RFBoost and Label Powerset with Support Vector Machine as base learner outperformed other compared algorithms. Results also demonstrated that adaptation-based algorithms are faster than transformation-based algorithms.
引用
收藏
页码:212 / 227
页数:16
相关论文
共 50 条
  • [1] Learning Semantic Similarity for Multi-label Text Categorization
    Li, Li
    Wang, Mengxiang
    Zhang, Longkai
    Wang, Houfeng
    [J]. CHINESE LEXICAL SEMANTICS, 2014, 8922 : 260 - 269
  • [2] Selection strategies for multi-label text categorization
    Montejo-Raez, Arturo
    Urena-Lopez, Luis Alfonso
    [J]. ADVANCES IN NATURAL LANGUAGE PROCESSING, PROCEEDINGS, 2006, 4139 : 585 - 592
  • [3] Boosting multi-label hierarchical text categorization
    Esuli, Andrea
    Fagni, Tiziano
    Sebastiani, Fabrizio
    [J]. INFORMATION RETRIEVAL, 2008, 11 (04): : 287 - 313
  • [4] Boosting multi-label hierarchical text categorization
    Andrea Esuli
    Tiziano Fagni
    Fabrizio Sebastiani
    [J]. Information Retrieval, 2008, 11 : 287 - 313
  • [5] LABEL CORRELATION MIXTURE MODEL FOR MULTI-LABEL TEXT CATEGORIZATION
    He, Zhiyang
    Wu, Ji
    Lv, Ping
    [J]. 2014 IEEE WORKSHOP ON SPOKEN LANGUAGE TECHNOLOGY SLT 2014, 2014, : 83 - 88
  • [6] Weak Learning Algorithm for multi-label multiclass text categorization
    Xu, YY
    Zhou, XZ
    Guo, ZW
    [J]. 2002 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-4, PROCEEDINGS, 2002, : 890 - 894
  • [7] Scalable Multi-Label Arabic Text Classification
    Ahmed, Nizar A.
    Shehab, Mohammed A.
    Al-Ayyoub, Mahmoud
    Hmeidi, Ismail
    [J]. 2015 6TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2015, : 212 - 217
  • [8] Multi-label arabic text classification: an overview
    Aljedani, Nawal
    Alotaibi, Reem
    Taileb, Mounira
    [J]. International Journal of Advanced Computer Science and Applications, 2020, 11 (10): : 694 - 706
  • [9] Multi-Label Arabic Text Classification: An Overview
    Aljedani, Nawal
    Alotaibi, Reem
    Taileb, Mounira
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (10) : 694 - 706
  • [10] Multi-Label Arabic Text Classification Based On Deep Learning
    Alsukhni, Batool
    [J]. 2021 12TH INTERNATIONAL CONFERENCE ON INFORMATION AND COMMUNICATION SYSTEMS (ICICS), 2021, : 475 - 477