A K-means Improved CTGAN Oversampling Method for Data Imbalance Problem

被引：7

作者：

An, Chunsheng ^{[1
]}

Sun, Jingtong ^{[2
]}

Wang, Yifeng ^{[1
]}

Wei, Qingjie ^{[1
]}

机构：

[1] Chongqing Engn Res Ctr Software Qual Assurance Te, CQUPT, Chongqing, Peoples R China

[2] Seaquam Secondary, Delta, BC, Canada

来源：

2021 IEEE 21ST INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY (QRS 2021) | 2021年

关键词：

K-means; CTGAN; oversampling; data imbalance; SMOTE;

D O I：

10.1109/QRS54544.2021.00097

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

CTGAN is a tabular data synthesis method for privacy preservation, which is used in this paper for data imbalance problem. This paper proposes a method for dealing with imbalanced data sets that combines K-means clustering and CTGAN to address the imbalanced distribution of minority class examples that result from oversampling with CTGAN. By conducting experiments with the LightGBM algorithm on home loan and online shopping datasets, it is demonstrated that the CTGAN method achieves superior learning results in f1-score and G-mean metrics compared to the interpolation-based oversampling technique represented by SMOTE. The preceding results indicate that by applying the method described in this paper to handle an imbalanced dataset, one can obtain a dataset with more examples, a more uniform distribution, and less overfitting while still satisfying the original dataset's probability distribution.

引用

页码：883 / 887

页数：5

共 50 条

[1] An Improved Oversampling Method for imbalanced Data-SMOTE Based on Canopy and K-means
Guo, Chaoyou
Ma, Yankun
Xu, Zhe
Cao, Mengmeng
Yao, Qian
[J]. 2019 CHINESE AUTOMATION CONGRESS (CAC2019), 2019, : 1467 - 1469
[2] LR-SMOTE - An improved unbalanced data set oversampling based on K-means and SVM
Liang, X. W.
Jiang, A. P.
Li, T.
Xue, Y. Y.
Wang, G. T.
[J]. KNOWLEDGE-BASED SYSTEMS, 2020, 196
[3] Oversampling Method Based on Gaussian Distribution and K-Means Clustering
Hassan, Masoud Muhammed
Eesa, Adel Sabry
Mohammed, Ahmed Jameel
Arabo, Wahab Kh
[J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2021, 69 (01): : 451 - 469
[4] An Improved K-means Clustering Method based on Data Field
Xu, Cui
Liu, Yuhua
Xu, Ke
[J]. INTERNATIONAL CONFERENCE ON CONTROL SYSTEM AND AUTOMATION (CSA 2013), 2013, : 454 - 459
[5] An Improved Method for K-Means Clustering
Cui, Xiaowei
Wang, Fuxiang
[J]. 2015 INTERNATIONAL CONFERENCE ON COMPUTATIONAL INTELLIGENCE AND COMMUNICATION NETWORKS (CICN), 2015, : 756 - 759
[6] Improved PTAS for the constrained k-means problem
Qilong Feng
Jiaxin Hu
Neng Huang
Jianxin Wang
[J]. Journal of Combinatorial Optimization, 2019, 37 : 1091 - 1110
[7] Improved PTAS for the constrained k-means problem
Feng, Qilong
Hu, Jiaxin
Huang, Neng
Wang, Jianxin
[J]. JOURNAL OF COMBINATORIAL OPTIMIZATION, 2019, 37 (04) : 1091 - 1110
[8] An improved K-means algorithm for big data
Moodi, Fatemeh
Saadatfar, Hamid
[J]. IET SOFTWARE, 2022, 16 (01) : 48 - 59
[9] Improved Smoothed Analysis of the k-Means Method
Manthey, Bodo
Roeglin, Heiko
[J]. PROCEEDINGS OF THE TWENTIETH ANNUAL ACM-SIAM SYMPOSIUM ON DISCRETE ALGORITHMS, 2009, : 461 - +
[10] Application of an improved K-Means algorithm in data mining
Wang, JM
Guo, H
[J]. PROCEEDINGS OF THE 11TH INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT, VOLS 1 AND 2: INDUSTRIAL ENGINEERING AND ENGINEERING MANAGEMENT IN THE GLOBAL ECONOMY, 2005, : 416 - 419

← 1 2 3 4 5 →