MCNN-LSTM: Combining CNN and LSTM to Classify Multi-Class Text in Imbalanced News Data

被引:11
|
作者
Hasib, Khan Md [1 ]
Azam, Sami [2 ]
Karim, Asif [2 ]
Marouf, Ahmed Al [3 ]
Shamrat, F. M. Javed Mehedi [4 ]
Montaha, Sidratul [5 ]
Yeo, Kheng Cher [2 ]
Jonkman, Mirjam [2 ]
Alhajj, Reda [3 ,6 ,7 ]
Rokne, Jon G. [3 ]
机构
[1] Bangladesh Univ Business & Technol, Dept Comp Sci & Engn, Dhaka 1216, Bangladesh
[2] Charles Darwin Univ, Fac Sci & Technol, Casuarina, NT 0810, Australia
[3] Univ Calgary, Dept Comp Sci, Calgary, AB T2N 1N4, Canada
[4] Univ Malaya, Dept Comp Syst & Technol, Kuala Lumpur 50603, Malaysia
[5] Daffodil Int Univ, Dept Comp Sci & Engn, Dhaka 1207, Bangladesh
[6] Istanbul Medipol Univ, Dept Comp Engn, TR-34810 Istanbul, Turkiye
[7] Univ Southern Denmark, Dept Heath Informat, DK-5230 Odense, Denmark
关键词
Big data; text classification; imbalanced data; machine learning; MCNN-LSTM; CLASSIFICATION; EMAILS;
D O I
10.1109/ACCESS.2023.3309697
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Searching, retrieving, and arranging text in ever-larger document collections necessitate more efficient information processing algorithms. Document categorization is a crucial component of various information processing systems for supervised learning. As the quantity of documents grows, the performance of classic supervised classifiers has deteriorated because of the number of document categories. Assigning documents to a predetermined set of classes is called text classification. It is utilized extensively in a wide range of data-intensive applications. However, the fact that real-world implementations of these models are plagued with shortcomings begs for more investigation. Imbalanced datasets hinder the most prevalent high-performance algorithms. In this paper, we propose an approach name multi-class Convolutional Neural Network (MCNN)-Long Short-Time Memory (LSTM), which combines two deep learning techniques, Convolutional Neural Network (CNN) and Long Short-Time Memory, for text classification in news data. CNN's are used as feature extractors for the LSTMs on text input data and have the spatial structure of words in a sentence, paragraph, or document. The dataset is also imbalanced, and we use the Tomek-Link algorithm to balance the dataset and then apply our model, which shows better performance in terms of F1-score (98%) and Accuracy (99.71%) than the existing works. The combination of deep learning techniques used in our approach is ideal for the classification of imbalanced datasets with underrepresented categories. Hence, our method outperformed other machine learning algorithms in text classification by a large margin. We also compare our results with traditional machine learning algorithms in terms of imbalanced and balanced datasets.
引用
收藏
页码:93048 / 93063
页数:16
相关论文
共 50 条
  • [1] An Efficient Hybrid LSTM-CNN and CNN-LSTM with GloVe for Text Multi-class Sentiment Classification in Gender Violence
    Ismail, Abdul Azim
    Yusoff, Marina
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2022, 13 (09) : 853 - 863
  • [2] Optimizing Multi-Class Text Classification Models for Imbalanced News Data
    Anitha, S.
    Kavi Varshini, E.
    Haritha Mahalakshmi, N.
    Jishnu, S.
    [J]. 2024 15th International Conference on Computing Communication and Networking Technologies, ICCCNT 2024, 2024,
  • [3] Multi-class random forest model to classify wastewater treatment imbalanced data
    Distefano, Veronica
    Palma, Monica
    De Iaco, Sandra
    [J]. SOCIO-ECONOMIC PLANNING SCIENCES, 2024, 95
  • [4] Text classification of Chinese news based on multi-scale CNN and LSTM hybrid model
    Zhai, ZhengLi
    Zhang, Xin
    Fang, FeiFei
    Yao, LuYao
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (14) : 20975 - 20988
  • [5] Text classification of Chinese news based on multi-scale CNN and LSTM hybrid model
    ZhengLi Zhai
    Xin Zhang
    FeiFei Fang
    LuYao Yao
    [J]. Multimedia Tools and Applications, 2023, 82 : 20975 - 20988
  • [6] Multi-class Boosting for Imbalanced Data
    Fernandez-Baldera, Antonio
    Buenaposada, Jose M.
    Baumela, Luis
    [J]. PATTERN RECOGNITION AND IMAGE ANALYSIS (IBPRIA 2015), 2015, 9117 : 57 - 64
  • [7] News Text Classification Based on Improved Bi-LSTM-CNN
    Li, Chenbin
    Zhan, Guohua
    Li, Zhihua
    [J]. 2018 NINTH INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY IN MEDICINE AND EDUCATION (ITME 2018), 2018, : 890 - 893
  • [8] Multi-class WHMBoost: An ensemble algorithm for multi-class imbalanced data
    Zhao, Jiakun
    Jin, Ju
    Zhang, Yibo
    Zhang, Ruifeng
    Chen, Si
    [J]. INTELLIGENT DATA ANALYSIS, 2022, 26 (03) : 599 - 614
  • [9] Roman Urdu Headline News Text Classification Using RNN, LSTM and CNN
    Kandhro, Irfan Ali
    Jumani, Sahar Zafar
    Kumar, Kamlash
    Hafeez, Abdul
    Ali, Fayyaz
    [J]. ADVANCES IN DATA SCIENCE AND ADAPTIVE ANALYSIS, 2020, 12 (02)
  • [10] Evaluating Difficulty of Multi-class Imbalanced Data
    Lango, Mateusz
    Napierala, Krystyna
    Stefanowski, Jerzy
    [J]. FOUNDATIONS OF INTELLIGENT SYSTEMS, ISMIS 2017, 2017, 10352 : 312 - 322