Ensemble of keyword extraction methods and classifiers in text classification

被引:468
|
作者
Onan, Aytug [1 ]
Korukoglu, Serdar [2 ]
Bulut, Hasan [2 ]
机构
[1] Celal Bayar Univ, Dept Comp Engn, TR-45140 Muradiye, Manisa, Turkey
[2] Ege Univ, Dept Comp Engn, TR-35100 Izmir, Turkey
关键词
Keyword extraction; Text classification; Ensemble learning; Scientific text classification; AUTOMATIC EXTRACTION; KEYPHRASES;
D O I
10.1016/j.eswa.2016.03.045
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naive Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification. (C) 2016 Elsevier Ltd. All rights reserved.
引用
收藏
页码:232 / 247
页数:16
相关论文
共 50 条
  • [1] A balanced ensemble approach to weighting classifiers for text classification
    Fung, Gabriel Pui Cheong
    Yu, Jeffrey Xu
    Wang, Haixun
    Cheung, David W.
    Liu, Huan
    [J]. ICDM 2006: SIXTH INTERNATIONAL CONFERENCE ON DATA MINING, PROCEEDINGS, 2006, : 869 - 873
  • [2] Another Perspective on Ensemble Methods for Automatic Keyword Extraction
    Lucci, Stephen
    Cox, James L.
    Pay, Tayfun
    [J]. 2018 IEEE INTERNATIONAL CONFERENCE ON BIG DATA (BIG DATA), 2018, : 5424 - 5426
  • [3] Analysis of keyword extraction methods for legal document classification
    Marinato, Matheus S.
    Santana, Ewaldo E. C.
    Jacob Jr, Antonio F. L.
    [J]. REVISTA BRASILEIRA DE COMPUTACAO APLICADA, 2024, 16 (02): : 88 - 96
  • [4] Automatic keyword extraction based on imbalanced classification methods
    Jiang, Weidong
    Hui, Xiaofeng
    [J]. Journal of Computational Information Systems, 2013, 9 (21): : 8483 - 8490
  • [5] Averaging and Boosting Methods in Ensemble-Based Classifiers for Text Readability
    Korniichuk, Ruslan
    Boryczka, Mariusz
    [J]. KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 3677 - 3685
  • [6] Incorporating keyword extraction and attention for multi-label text classification
    Zhao, Hua
    Li, Xiaoqian
    Wang, Fengling
    Zeng, Qingtian
    Diao, Xiuli
    [J]. JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2023, 45 (02) : 2083 - 2093
  • [7] Keyword extraction for text categorization
    An, JY
    Chen, YPP
    [J]. PROCEEDINGS OF THE 2005 INTERNATIONAL CONFERENCE ON ACTIVE MEDIA TECHNOLOGY (AMT 2005), 2005, : 556 - 561
  • [8] Building an Ensemble of Fine-Tuned Naive Bayesian Classifiers for Text Classification
    El Hindi, Khalil
    AlSalman, Hussien
    Qasem, Safwan
    Al Ahmadi, Saad
    [J]. ENTROPY, 2018, 20 (11)
  • [9] WEC: Weighted Ensemble of Text Classifiers
    Upadhyay, Ashish
    Tien Thanh Nguyen
    Massie, Stewart
    McCall, John
    [J]. 2020 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2020,
  • [10] A Novel Graph-Based Ensemble Token Classification Model for Keyword Extraction
    Hüma Kılıç
    Aydın Çetin
    [J]. Arabian Journal for Science and Engineering, 2023, 48 : 10673 - 10680