Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training

被引:63
|
作者
Nasir, Inzamam Mashood [1 ]
Khan, Muhammad Attique [1 ]
Yasmin, Mussarat [2 ]
Shah, Jamal Hussain [2 ]
Gabryel, Marcin [3 ]
Scherer, Rafal [3 ]
Damasevicius, Robertas [4 ]
机构
[1] HITEC Univ, Dept Comp Sci, Taxila 47080, Pakistan
[2] COMSATS Univ Islamabad, Dept Comp Sci, Wah Campus, Wah Cantonment 47040, Pakistan
[3] Czestochowa Tech Univ, Dept Intelligent Comp Syst, PL-42200 Czestochowa, Poland
[4] Silesian Tech Univ, Fac Appl Math, PL-44100 Gliwice, Poland
关键词
document classification; deep learning; feature selection; data augmentation; imbalanced dataset;
D O I
10.3390/s20236793
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique's major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.
引用
收藏
页码:1 / 18
页数:18
相关论文
共 50 条
  • [21] An algorithm acceleration framework for correlation-based feature selection
    Yan, Xuefeng
    Zhang, Yuqing
    Khan, Arif Ali
    [J]. 2020 2ND INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE COMMUNICATION AND NETWORK SECURITY (CSCNS2020), 2021, 336
  • [22] Heuristically Reducing the Cost of Correlation-based Feature Selection
    Brown, Katherine E.
    Talbert, Douglas A.
    [J]. PROCEEDINGS OF THE 2019 ANNUAL ACM SOUTHEAST CONFERENCE (ACMSE 2019), 2019, : 24 - 30
  • [23] Correlation-based Gene Selection and Classification Using Taguchi-BPSO
    Chuang, L. -Y.
    Yang, C. -S.
    Wu, K. -C.
    Yang, C. -H.
    [J]. METHODS OF INFORMATION IN MEDICINE, 2010, 49 (03) : 254 - 268
  • [24] Feature selection with Fast Correlation-Based Filter for Breast cancer prediction and Classification using Machine Learning Algorithms
    Khourdifi, Youness
    Bahaj, Mohamed
    [J]. 2018 INTERNATIONAL SYMPOSIUM ON ADVANCED ELECTRICAL AND COMMUNICATION TECHNOLOGIES (ISAECT), 2018,
  • [25] Feature selection for document classification based on topology
    El Barbary, O. G.
    Salama, A. S.
    [J]. EGYPTIAN INFORMATICS JOURNAL, 2018, 19 (02) : 129 - 132
  • [26] Partial imputation to improve predictive modelling in insurance risk classification using a hybrid positive selection algorithm and correlation-based feature selection
    Duma, Mlungisi
    Twala, Bhekisipho
    Nelwamondo, Fulufhelo V.
    Marwala, Tshilidzi
    [J]. CURRENT SCIENCE, 2012, 103 (06): : 697 - 705
  • [27] Impact of Correlation-based Feature Selection on Photovoltaic Power Prediction
    Kwon, Jung-Hyok
    Lee, Sang-Woo
    Lee, Sol-Bee
    Kim, Eui-Jik
    [J]. 2019 4TH TECHNOLOGY INNOVATION MANAGEMENT AND ENGINEERING SCIENCE INTERNATIONAL CONFERENCE (TIMES-ICON), 2019,
  • [28] Correlation-Based Feature Selection to Identify Functional Dynamics in Proteins
    Diez, Georg
    Nagel, Daniel
    Stock, Gerhard
    [J]. JOURNAL OF CHEMICAL THEORY AND COMPUTATION, 2022, 18 (08) : 5079 - 5088
  • [29] Correlation-Based and Causal Feature Selection Analysis for Ensemble Classifiers
    Duangsoithong, Rakkrit
    Windeatt, Terry
    [J]. ARTIFICIAL NEURAL NETWORKS IN PATTERN RECOGNITION, PROCEEDINGS, 2010, 5998 : 25 - 36
  • [30] Feature Subset Selection: A Correlation-Based SVM Filter Approach
    Li, Boyang
    Wang, Qiangwei
    Hu, Jinglu
    [J]. IEEJ TRANSACTIONS ON ELECTRICAL AND ELECTRONIC ENGINEERING, 2011, 6 (02) : 173 - 179