A K-means Improved CTGAN Oversampling Method for Data Imbalance Problem

被引:7
|
作者
An, Chunsheng [1 ]
Sun, Jingtong [2 ]
Wang, Yifeng [1 ]
Wei, Qingjie [1 ]
机构
[1] Chongqing Engn Res Ctr Software Qual Assurance Te, CQUPT, Chongqing, Peoples R China
[2] Seaquam Secondary, Delta, BC, Canada
关键词
K-means; CTGAN; oversampling; data imbalance; SMOTE;
D O I
10.1109/QRS54544.2021.00097
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
CTGAN is a tabular data synthesis method for privacy preservation, which is used in this paper for data imbalance problem. This paper proposes a method for dealing with imbalanced data sets that combines K-means clustering and CTGAN to address the imbalanced distribution of minority class examples that result from oversampling with CTGAN. By conducting experiments with the LightGBM algorithm on home loan and online shopping datasets, it is demonstrated that the CTGAN method achieves superior learning results in f1-score and G-mean metrics compared to the interpolation-based oversampling technique represented by SMOTE. The preceding results indicate that by applying the method described in this paper to handle an imbalanced dataset, one can obtain a dataset with more examples, a more uniform distribution, and less overfitting while still satisfying the original dataset's probability distribution.
引用
收藏
页码:883 / 887
页数:5
相关论文
共 50 条
  • [1] An Improved Oversampling Method for imbalanced Data-SMOTE Based on Canopy and K-means
    Guo, Chaoyou
    Ma, Yankun
    Xu, Zhe
    Cao, Mengmeng
    Yao, Qian
    [J]. 2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 1467 - 1469
  • [2] LR-SMOTE - An improved unbalanced data set oversampling based on K-means and SVM
    Liang, X. W.
    Jiang, A. P.
    Li, T.
    Xue, Y. Y.
    Wang, G. T.
    [J]. KNOWLEDGE-BASED SYSTEMS, 2020, 196
  • [3] Oversampling Method Based on Gaussian Distribution and K-Means Clustering
    Hassan, Masoud Muhammed
    Eesa, Adel Sabry
    Mohammed, Ahmed Jameel
    Arabo, Wahab Kh
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 69 (01): : 451 - 469
  • [4] An Improved K-means Clustering Method based on Data Field
    Xu, Cui
    Liu, Yuhua
    Xu, Ke
    [J]. INTERNATIONAL CONFERENCE ON CONTROL SYSTEM AND AUTOMATION (CSA 2013), 2013, : 454 - 459
  • [5] An Improved Method for K-Means Clustering
    Cui, Xiaowei
    Wang, Fuxiang
    [J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 756 - 759
  • [6] Improved PTAS for the constrained k-means problem
    Qilong Feng
    Jiaxin Hu
    Neng Huang
    Jianxin Wang
    [J]. Journal of Combinatorial Optimization, 2019, 37 : 1091 - 1110
  • [7] Improved PTAS for the constrained k-means problem
    Feng, Qilong
    Hu, Jiaxin
    Huang, Neng
    Wang, Jianxin
    [J]. JOURNAL OF COMBINATORIAL OPTIMIZATION, 2019, 37 (04) : 1091 - 1110
  • [8] An improved K-means algorithm for big data
    Moodi, Fatemeh
    Saadatfar, Hamid
    [J]. IET SOFTWARE, 2022, 16 (01) : 48 - 59
  • [9] Improved Smoothed Analysis of the k-Means Method
    Manthey, Bodo
    Roeglin, Heiko
    [J]. PROCEEDINGS OF THE TWENTIETH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2009, : 461 - +
  • [10] Application of an improved K-Means algorithm in data mining
    Wang, JM
    Guo, H
    [J]. PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT, VOLS 1 AND 2: INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT IN THE GLOBAL ECONOMY, 2005, : 416 - 419