Multilabel Over-sampling and Under-sampling with Class Alignment for Imbalanced Multilabel Text Classification

被引:11
|
作者
Taha, Adil Yaseen [1 ]
Tiun, Sabrina [1 ]
Abd Rahman, Abdul Hadi [1 ]
Sabah, Ali [1 ]
机构
[1] Univ Kebangsaan Malaysia, Fac Informat Sci & Technol, Bangi, Selangor, Malaysia
关键词
Data mining; multilabel text classification; class imbalance problem; resampling method; class alignment; FEATURE-SELECTION; DATA-SETS; LABEL; CATEGORIZATION; INSIGHT; SMOTE;
D O I
10.32890/jict2021.20.3.6
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Simultaneous multiple labeling of documents, also known as multilabel text classification, will not perform optimally if the class is highly imbalanced. Class imbalance entails skewness in the fundamental data for distribution that leads to more difficulty in classification. Random over-sampling and under-sampling are common approaches to solve the class imbalance problem. However, these approaches have several drawbacks; under-sampling is likely to dispose of useful data, whereas over-sampling can heighten the probability of overfitting. Therefore, a new method that can avoid discarding useful data and overfitting problems is needed. This study proposed a method to tackle the class imbalance problem by combining multilabel over-sampling and under-sampling with class alignment (ML-OUSCA). In the proposed ML-OUSCA, instead of using all the training instances, it drew a new training set by over-sampling small size classes and under-sampling big size classes. To evaluate the proposed ML-OUSCA, evaluation metrics of average precision, average recall, and average F-measure on three benchmark datasets, namely Reuters-21578, Bibtex, and Enron datasets, were performed. Experimental results showed that the proposed ML-OUSCA outperformed the chosen baseline random resampling approaches: K-means SMOTE and KNN-US. Therefore, based on the results, it can be concluded that designing a resampling method based on class imbalance together with class alignment will improve multilabel classification even better than just the random resampling method.
引用
收藏
页码:423 / 456
页数:34
相关论文
共 50 条
  • [21] Database-Text Alignment via Structured Multilabel Classification
    Snyder, Benjamin
    Barzilay, Regina
    [J]. 20TH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2007, : 1713 - 1718
  • [22] Cluster-based Under-sampling with Random Forest for Multi-Class Imbalanced Classification
    Arafat, Md. Yasir
    Hoque, Sabera
    Farid, Dewan Md.
    [J]. 2017 11TH INTERNATIONAL CONFERENCE ON SOFTWARE, KNOWLEDGE, INFORMATION MANAGEMENT AND APPLICATIONS (SKIMA), 2017,
  • [23] AMDO: An Over-Sampling Technique for Multi-Class Imbalanced Problems
    Yang, Xuebing
    Kuang, Qiuming
    Zhang, Wensheng
    Zhang, Guoping
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2018, 30 (09) : 1672 - 1685
  • [24] A self-adaptive synthetic over-sampling technique for imbalanced classification
    Gu, Xiaowei
    Angelov, Plamen P.
    Soares, Eduardo A.
    [J]. INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2020, 35 (06) : 923 - 943
  • [25] An Approach to Imbalanced Data Classification Based on Instance Selection and Over-Sampling
    Czarnowski, Ireneusz
    Jedrzejowicz, Piotr
    [J]. COMPUTATIONAL COLLECTIVE INTELLIGENCE, PT I, 2019, 11683 : 601 - 610
  • [26] Enriched Over-Sampling Techniques for Improving Classification of Imbalanced Big Data
    Patil, Sachin Subhash
    Sonavane, Shefali Pratap
    [J]. 2017 THIRD IEEE INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING SERVICE AND APPLICATIONS (IEEE BIGDATASERVICE 2017), 2017, : 1 - 10
  • [27] Clustering boundary over-sampling classification method for imbalanced data sets
    Lou, Xiao-Jun
    Sun, Yu-Xuan
    Liu, Hai-Tao
    [J]. Liu, H.-T. (liuhaitao@wsn.cn), 1600, Zhejiang University (47): : 944 - 950
  • [28] Diversity and Separable Metrics in Over-Sampling Technique for Imbalanced Data Classification
    Mahmoudi, Shadi
    Moradi, Parham
    Akhlaghian, Fardin
    Moradi, Rizan
    [J]. 2014 4TH INTERNATIONAL CONFERENCE ON COMPUTER AND KNOWLEDGE ENGINEERING (ICCKE), 2014, : 152 - 158
  • [29] Abstention-SMOTE: An over-sampling approach for imbalanced data classification
    Zhang, Cheng
    Chen, Yufei
    Liu, Xianhui
    Zhao, Xiaodong
    [J]. PROCEEDINGS OF THE 2017 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY (ICIT 2017), 2017, : 17 - 21
  • [30] AN IMBALANCED DATA CLASSIFICATION METHOD BASED ON AUTOMATIC CLUSTERING UNDER-SAMPLING
    Deng, Xiaoheng
    Zhong, Weijian
    Ren, Ju
    Zeng, Detian
    Zhang, Honggang
    [J]. 2016 IEEE 35TH INTERNATIONAL PERFORMANCE COMPUTING AND COMMUNICATIONS CONFERENCE (IPCCC), 2016,