Ensemble of keyword extraction methods and classifiers in text classification

被引：468

作者：

Onan, Aytug ^{[1
]}

Korukoglu, Serdar ^{[2
]}

Bulut, Hasan ^{[2
]}

机构：

[1] Celal Bayar Univ, Dept Comp Engn, TR-45140 Muradiye, Manisa, Turkey

[2] Ege Univ, Dept Comp Engn, TR-35100 Izmir, Turkey

来源：

EXPERT SYSTEMS WITH APPLICATIONS | 2016年 / 57卷

关键词：

Keyword extraction; Text classification; Ensemble learning; Scientific text classification; AUTOMATIC EXTRACTION; KEYPHRASES;

D O I：

10.1016/j.eswa.2016.03.045

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automatic keyword extraction is an important research direction in text mining, natural language processing and information retrieval. Keyword extraction enables us to represent text documents in a condensed way. The compact representation of documents can be helpful in several applications, such as automatic indexing, automatic summarization, automatic classification, clustering and filtering. For instance, text classification is a domain with high dimensional feature space challenge. Hence, extracting the most important/relevant words about the content of the document and using these keywords as the features can be extremely useful. In this regard, this study examines the predictive performance of five statistical keyword extraction methods (most frequent measure based keyword extraction, term frequency-inverse sentence frequency based keyword extraction, co-occurrence statistical information based keyword extraction, eccentricity-based keyword extraction and TextRank algorithm) on classification algorithms and ensemble methods for scientific text document classification (categorization). In the study, a comprehensive study of comparing base learning algorithms (Naive Bayes, support vector machines, logistic regression and Random Forest) with five widely utilized ensemble methods (AdaBoost, Bagging, Dagging, Random Subspace and Majority Voting) is conducted. To the best of our knowledge, this is the first empirical analysis, which evaluates the effectiveness of statistical keyword extraction methods in conjunction with ensemble learning algorithms. The classification schemes are compared in terms of classification accuracy, F-measure and area under curve values. To validate the empirical analysis, two-way ANOVA test is employed. The experimental analysis indicates that Bagging ensemble of Random Forest with the most-frequent based keyword extraction method yields promising results for text classification. For ACM document collection, the highest average predictive performance (93.80%) is obtained with the utilization of the most frequent based keyword extraction method with Bagging ensemble of Random Forest algorithm. In general, Bagging and Random Subspace ensembles of Random Forest yield promising results. The empirical analysis indicates that the utilization of keyword-based representation of text documents in conjunction with ensemble learning can enhance the predictive performance and scalability of text classification schemes, which is of practical importance in the application fields of text classification. (C) 2016 Elsevier Ltd. All rights reserved.

引用

页码：232 / 247

页数：16

共 50 条

[31] Ensemble Learning for Keyword Extraction from Event Descriptions
Geadas, Pedro
Alves, Ana
Ribeiro, Bernardete
[J]. PROCEEDINGS OF THE 2014 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2014, : 2669 - 2676
[32] Online classifiers for Chinese Text Classification and Filtering
Guo, YH
Liu, JY
Wang, C
Zhong, YX
[J]. 2003 INTERNATIONAL CONFERENCE ON NATURAL LANGUAGE PROCESSING AND KNOWLEDGE ENGINEERING, PROCEEDINGS, 2003, : 656 - 662
[33] Bayesian Naive Bayes classifiers to text classification
Xu, Shuo
[J]. JOURNAL OF INFORMATION SCIENCE, 2018, 44 (01) : 48 - 59
[34] Improving Classification Accuracy of Automated Text Classifiers
Rastogi, Shivam
[J]. 2018 7TH INTERNATIONAL CONFERENCE ON RELIABILITY, INFOCOM TECHNOLOGIES AND OPTIMIZATION (TRENDS AND FUTURE DIRECTIONS) (ICRITO) (ICRITO), 2018, : 239 - 245
[35] An Improved Focused Crawler Based on Text Keyword Extraction
Zheng, Zhang
Qian, Du
[J]. PROCEEDINGS OF 2016 5TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT), 2016, : 386 - 390
[36] A Combination of Feature Extraction Methods with an Ensemble of Different Classifiers for Protein Structural Class Prediction Problem
Dehzangi, Abdollah
Paliwal, Kuldip
Sharma, Alok
Dehzangi, Omid
Sattar, Abdul
[J]. IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2013, 10 (03) : 564 - 575
[37] Chinese Automatic Text Summarization Based on Keyword Extraction
Jiang Xiao-yu
[J]. FIRST INTERNATIONAL WORKSHOP ON DATABASE TECHNOLOGY AND APPLICATIONS, PROCEEDINGS, 2009, : 225 - 228
[38] Text Reuse Detection by Keyword Extraction for Telegram Channels
Saki, Misam
Faili, Heshaam
Asadpour, Masoud
[J]. 2017 25TH IRANIAN CONFERENCE ON ELECTRICAL ENGINEERING (ICEE), 2017, : 1481 - 1484
[39] Analysis of Text Collections for the Purposes of Keyword Extraction Task
Vanyushkin, Alexander
Graschenko, Leonid
[J]. JOURNAL OF INFORMATION AND ORGANIZATIONAL SCIENCES, 2020, 44 (01) : 171 - 184
[40] Comparing keyword extraction techniques for WEBSOM text archives
Azcarraga, AP
Yap, TN
[J]. ICTAI 2001: 13TH IEEE INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, PROCEEDINGS, 2001, : 187 - 194

← 1 2 3 4 5 →