Imbalanced data sampling design based on grid boundary domain for big data

被引：2

作者：

He, Hanji ^{[1
]}

He, Jianfeng ^{[1
]}

Zhang, Liwei ^{[2
]}

机构：

[1] South China Univ Technol, Sch Econ & Finance, Guangzhou, Peoples R China

[2] Ping An Insurance Co China, Shenzhen, Peoples R China

来源：

COMPUTATIONAL STATISTICS | 2025年 / 40卷 / 01期

关键词：

Mass of grid cell; Mixed-resampling; Boundary domain; Random under-sampling; SUBDATA SELECTION; SMOTE;

D O I：

10.1007/s00180-024-01471-8

中图分类号：

O21 [概率论与数理统计]; C8 [统计学];

学科分类号：

020208 ; 070103 ; 0714 ;

摘要：

The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.

引用

页码：27 / 64

页数：38

共 50 条

[31] Data reduction techniques for highly imbalanced medicare Big Data
Hancock, John T.
Wang, Huanjing
Khoshgoftaar, Taghi M.
Liang, Qianxin
JOURNAL OF BIG DATA, 2024, 11 (01)
[32] SHAP as a Data Reduction Technique for Highly Imbalanced Big Data
Hancock, John T.
Bauder, Richard A.
Khoshgoftaar, Taghi M.
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2025,
[33] Data reduction techniques for highly imbalanced medicare Big Data
John T. Hancock
Huanjing Wang
Taghi M. Khoshgoftaar
Qianxin Liang
Journal of Big Data, 11
[34] Imbalanced big data classification based on virtual reality in cloud computing
Wen-da Xie
Xiaochun Cheng
Multimedia Tools and Applications, 2020, 79 : 16403 - 16420
[35] A Dynamic Spark-based Classification Framework for Imbalanced Big Data
Abdel-Hamid, Nahla B.
ElGhamrawy, Sally
El Desouky, Ali
Arafat, Hesham
JOURNAL OF GRID COMPUTING, 2018, 16 (04) : 607 - 626
[36] Evolutionary Undersampling for Imbalanced Big Data Classification
Triguero, I.
Galar, M.
Vluymans, S.
Cornelis, C.
Bustince, H.
Herrera, F.
Saeys, Y.
2015 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2015, : 715 - 722
[37] Imbalanced big data classification based on virtual reality in cloud computing
Xie, Wen-da
Cheng, Xiaochun
MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (23-24) : 16403 - 16420
[38] A Dynamic Spark-based Classification Framework for Imbalanced Big Data
Nahla B. Abdel-Hamid
Sally ElGhamrawy
Ali El Desouky
Hesham Arafat
Journal of Grid Computing, 2018, 16 : 607 - 626
[39] A Data Colocation Grid Framework for Big Data Medical Image Processing - Backend Design
Bao, Shunxing
Huo, Yuankai
Parvathaneni, Prasanna
Plassard, Andrew J.
Bermudez, Camilo
Yao, Yuang
Lyu, Ilwoo
Gokhale, Aniruddha
Landman, Bennett A.
MEDICAL IMAGING 2018: IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, 2018, 10579
[40] Sampling for Big Data: A Tutorial
Cormode, Graham
Duffield, Nick
PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 1975 - 1975

← 1 2 3 4 5 →