Imbalanced data sampling design based on grid boundary domain for big data

被引:2
|
作者
He, Hanji [1 ]
He, Jianfeng [1 ]
Zhang, Liwei [2 ]
机构
[1] South China Univ Technol, Sch Econ & Finance, Guangzhou, Peoples R China
[2] Ping An Insurance Co China, Shenzhen, Peoples R China
关键词
Mass of grid cell; Mixed-resampling; Boundary domain; Random under-sampling; SUBDATA SELECTION; SMOTE;
D O I
10.1007/s00180-024-01471-8
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.
引用
收藏
页码:27 / 64
页数:38
相关论文
共 50 条
  • [21] SVM Learning from Imbalanced Data by GA Sampling for Protein Domain Prediction
    Zou, Shuxue
    Huang, Yanxin
    Wang, Yan
    Wang, Jianxin
    Zhou, Chunguang
    PROCEEDINGS OF THE 9TH INTERNATIONAL CONFERENCE FOR YOUNG COMPUTER SCIENTISTS, VOLS 1-5, 2008, : 982 - +
  • [22] A Mixed Sampling Method for Imbalanced Data Based on Neighborhood Density
    Hu, Feng
    Yu, Chunlin
    Dai, Jin
    Liu, Ke
    2019 IEEE 4TH INTERNATIONAL CONFERENCE ON CLOUD COMPUTING AND BIG DATA ANALYSIS (ICCCBDA), 2019, : 94 - 98
  • [23] Binary imbalanced big data classification based on fuzzy data reduction and classifier fusion
    Zhai, Junhai
    Wang, Mohan
    Zhang, Sufang
    SOFT COMPUTING, 2022, 26 (06) : 2781 - 2792
  • [24] Cluster-based sampling approaches to imbalanced data distributions
    Yen, Show-Jane
    Lee, Yue-Shi
    DATA WAREHOUSING AND KNOWLEDGE DISCOVERY, PROCEEDINGS, 2006, 4081 : 427 - 436
  • [25] Binary imbalanced big data classification based on fuzzy data reduction and classifier fusion
    Junhai Zhai
    Mohan Wang
    Sufang Zhang
    Soft Computing, 2022, 26 : 2781 - 2792
  • [26] A dynamic ensemble learning based data mining framework for medical imbalanced big data
    Rithani, M.
    Kumar, R. Prasanna
    Ali, Altalbe
    KNOWLEDGE-BASED SYSTEMS, 2025, 310
  • [27] Ensemble of Classifiers Based on Multiobjective Genetic Sampling for Imbalanced Data
    Fernandes, Everlandio R. Q.
    de Carvalho, Andre C. P. L. F.
    Yao, Xin
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2020, 32 (06) : 1104 - 1115
  • [28] Safe sample screening based sampling method for imbalanced data
    Shi H.
    Liu Y.
    Ji S.
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2019, 32 (06): : 545 - 556
  • [29] A design of information granule-based under-sampling method in imbalanced data classification
    Tianyu Liu
    Xiubin Zhu
    Witold Pedrycz
    Zhiwu Li
    Soft Computing, 2020, 24 : 17333 - 17347
  • [30] A design of information granule-based under-sampling method in imbalanced data classification
    Liu, Tianyu
    Zhu, Xiubin
    Pedrycz, Witold
    Li, Zhiwu
    SOFT COMPUTING, 2020, 24 (22) : 17333 - 17347