Pearson Correlation-Based Feature Selection for Document Classification Using Balanced Training

被引:63
|
作者
Nasir, Inzamam Mashood [1 ]
Khan, Muhammad Attique [1 ]
Yasmin, Mussarat [2 ]
Shah, Jamal Hussain [2 ]
Gabryel, Marcin [3 ]
Scherer, Rafal [3 ]
Damasevicius, Robertas [4 ]
机构
[1] HITEC Univ, Dept Comp Sci, Taxila 47080, Pakistan
[2] COMSATS Univ Islamabad, Dept Comp Sci, Wah Campus, Wah Cantonment 47040, Pakistan
[3] Czestochowa Tech Univ, Dept Intelligent Comp Syst, PL-42200 Czestochowa, Poland
[4] Silesian Tech Univ, Fac Appl Math, PL-44100 Gliwice, Poland
关键词
document classification; deep learning; feature selection; data augmentation; imbalanced dataset;
D O I
10.3390/s20236793
中图分类号
O65 [分析化学];
学科分类号
070302 ; 081704 ;
摘要
Documents are stored in a digital form across several organizations. Printing this amount of data and placing it into folders instead of storing digitally is against the practical, economical, and ecological perspective. An efficient way of retrieving data from digitally stored documents is also required. This article presents a real-time supervised learning technique for document classification based on deep convolutional neural network (DCNN), which aims to reduce the impact of adverse document image issues such as signatures, marks, logo, and handwritten notes. The proposed technique's major steps include data augmentation, feature extraction using pre-trained neural network models, feature fusion, and feature selection. We propose a novel data augmentation technique, which normalizes the imbalanced dataset using the secondary dataset RVL-CDIP. The DCNN features are extracted using the VGG19 and AlexNet networks. The extracted features are fused, and the fused feature vector is optimized by applying a Pearson correlation coefficient-based technique to select the optimized features while removing the redundant features. The proposed technique is tested on the Tobacco3482 dataset, which gives a classification accuracy of 93.1% using a cubic support vector machine classifier, proving the validity of the proposed technique.
引用
收藏
页码:1 / 18
页数:18
相关论文
共 50 条
  • [1] Correlation-based feature selection strategy in neural classification
    Michalak, Krzysztof
    Kwasnicka, Halina
    [J]. ISDA 2006: SIXTH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, VOL 1, 2006, : 741 - 746
  • [2] A hybrid isotonic separation training algorithm with correlation-based isotonic feature selection for binary classification
    B. Malar
    R. Nadarajan
    J. Gowri Thangam
    [J]. Knowledge and Information Systems, 2019, 59 : 651 - 683
  • [3] A hybrid isotonic separation training algorithm with correlation-based isotonic feature selection for binary classification
    Malar, B.
    Nadarajan, R.
    Thangam, J. Gowri
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 59 (03) : 651 - 683
  • [4] Diagnosis of Bipolar Disease Using Correlation-Based Feature Selection with Different Classification Methods
    Cigdem, Ozkan
    Sulucay, Aysu
    Yilmaz, Arif
    Oguz, Kaya
    Demirel, Hasan
    Kitis, Omer
    Eker, Cagdas
    Gonul, Ali Saffet
    Unay, Devrim
    [J]. 2019 MEDICAL TECHNOLOGIES CONGRESS (TIPTEKNO), 2019, : 456 - 459
  • [5] Correlation-Based Feature Selection and Regression
    Cui, Yue
    Lin, Jesse S.
    Zhang, Shiliang
    Luo, Suhuai
    Tian, Qi
    [J]. ADVANCES IN MULTIMEDIA INFORMATION PROCESSING-PCM 2010, PT I, 2010, 6297 : 25 - +
  • [6] A Correlation-Based Feature Selection and Classification Approach for Autism Spectrum Disorder
    Verma, Manvi
    Kumar, Dinesh
    [J]. INTERNATIONAL JOURNAL OF INFORMATION SYSTEM MODELING AND DESIGN, 2021, 12 (02) : 51 - 66
  • [7] Enhancing Big Data Feature Selection Using a Hybrid Correlation-Based Feature Selection
    Mohamad, Masurah
    Selamat, Ali
    Krejcar, Ondrej
    Crespo, Ruben Gonzalez
    Herrera-Viedma, Enrique
    Fujita, Hamido
    [J]. ELECTRONICS, 2021, 10 (23)
  • [8] Correlation-based feature selection and classification via regression of segmented chromosomes using geometric features
    Tanvi Arora
    Renu Dhir
    [J]. Medical & Biological Engineering & Computing, 2017, 55 : 733 - 745
  • [9] Correlation-based feature selection using ant colony optimization
    Sadeghzadeh, M.
    Teshnehlab, M.
    [J]. World Academy of Science, Engineering and Technology, 2010, 40 : 497 - 502
  • [10] Correlation-Based Ensemble Feature Selection Using Bioinspired Algorithms and Classification Using Backpropagation Neural Network
    Christo, V. R. Elgin
    Nehemiah, H. Khanna
    Minu, B.
    Kannan, A.
    [J]. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE, 2019, 2019