Modified frequency-based term weighting schemes for text classification

被引:74
|
作者
Sabbah, Thabit [1 ,2 ,3 ]
Selamat, Ali [2 ,3 ,7 ]
Selamat, Md Hafiz [2 ]
Al-Anzi, Fawaz S. [4 ]
Viedma, Enrique Herrera [5 ,6 ]
Krejcar, Ondrej [7 ]
Fujita, Hamido [8 ]
机构
[1] Al Quds Open Univ QOU, Fac Technol & Appl Sci, POB 1804, Rammallah, Palestine
[2] Univ Teknol Malaysia, Fac Comp, Utm Johor Bahru 81310, Johor, Malaysia
[3] Univ Teknol Malaysia, UTM IRDA Ctr Excellence, Utm Johor Bahru 81310, Johor, Malaysia
[4] Kuwait Univ, Comp Engn Dept, POB 5969, Safat 13060, Kuwait
[5] Univ Granada, Dept Comp Sci & Artificial Intelligence, Granada, Spain
[6] King Abdulaziz Univ, Dept Elect & Comp Engn, Fac Engn, Jeddah 21589, Saudi Arabia
[7] Univ Hradec Kralove, FIM, Ctr Basic & Appl Res, Rokitanskeho 62, Hradec Kralove 50003, Czech Republic
[8] Iwate Prefectural Univ, 152-52 Sugo, Takizawa, Iwate 0200193, Japan
关键词
Term-weighting; Missing features; Absent terms; Vector Space Model; Text classification; EXTREME LEARNING-MACHINE; SUPPORT VECTOR MACHINE; FEATURE-SELECTION; NEURAL-NETWORKS; CATEGORIZATION; WEB; ALGORITHM; SYSTEM;
D O I
10.1016/j.asoc.2017.04.069
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
With the rapid growth of textual content on the Internet, automatic text categorization is a comparatively more effective solution in information organization and knowledge management. Feature selection, one of the basic phases in statistical-based text categorization, crucially depends on the term weighting methods In order to improve the performance of text categorization, this paper proposes four modified frequency-based term weighting schemes namely; mTF, mTFIDF, TFmIDF, and mTFmIDF. The proposed term weighting schemes take the amount of missing terms into account calculating the weight of existing terms. The proposed schemes show the highest performance for a SVM classifier with a micro-average F1 classification performance value of 97%. Moreover, benchmarking results on Reuters-21578, 20Newsgroups, and WebKB text-classification datasets, using different classifying algorithms such as SVM and KNN show that the proposed schemes mTF, mTFIDF, and mTFmIDF outperform other weighting schemes such as TF, TFIDF, and Entropy. Additionally, the statistical significance tests show a significant enhancement of the classification performance based on the modified schemes. (C) 2017 Elsevier B.V. All rights reserved.
引用
收藏
页码:193 / 206
页数:14
相关论文
共 50 条
  • [1] On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification
    Dogan, Turgut
    Uysal, Alper Kursat
    [J]. ARABIAN JOURNAL FOR SCIENCE AND ENGINEERING, 2019, 44 (11) : 9545 - 9560
  • [2] On Term Frequency Factor in Supervised Term Weighting Schemes for Text Classification
    Turgut Dogan
    Alper Kursat Uysal
    [J]. Arabian Journal for Science and Engineering, 2019, 44 : 9545 - 9560
  • [3] Modified Frequency-Based Term Weighting Scheme for Accurate Dark Web Content Classification
    Sabbah, Thabit
    Selamat, Ali
    [J]. INFORMATION RETRIEVAL TECHNOLOGY, AIRS 2014, 2014, 8870 : 184 - 196
  • [4] Using modified term frequency to improve term weighting for text classification
    Chen, Long
    Jiang, Liangxiao
    Li, Chaoqun
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2021, 101
  • [5] A survey of term weighting schemes for text classification
    Alsaeedi, Abdullah
    [J]. INTERNATIONAL JOURNAL OF DATA MINING MODELLING AND MANAGEMENT, 2020, 12 (02) : 237 - 254
  • [6] A Comparative Study on Term Weighting Schemes for Text Classification
    Mazyad, Ahmad
    Teytaud, Fabien
    Fonlupt, Cyril
    [J]. MACHINE LEARNING, OPTIMIZATION, AND BIG DATA, MOD 2017, 2018, 10710 : 100 - 108
  • [7] Learning a frequency-based weighting for medical image classification
    Gass, Tobias
    Depeursinge, Adrien
    Geissbuhler, Antoine
    Mueller, Henning
    [J]. MEDICAL IMAGING AND INFORMATICS, 2008, 4987 : 99 - +
  • [8] Modified DFS-based term weighting scheme for text classification
    Chen, Long
    Jiang, Liangxiao
    Li, Chaoqun
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2021, 168
  • [9] An improved term weighting method based on relevance frequency for text classification
    Li, Chuanxiao
    Li, Wenqiang
    Tang, Zhong
    Li, Song
    Xiang, Hai
    [J]. SOFT COMPUTING, 2023, 27 (07) : 3563 - 3579
  • [10] An improved term weighting method based on relevance frequency for text classification
    Chuanxiao Li
    Wenqiang Li
    Zhong Tang
    Song Li
    Hai Xiang
    [J]. Soft Computing, 2023, 27 : 3563 - 3579