Imbalanced data sampling design based on grid boundary domain for big data

被引:2
|
作者
He, Hanji [1 ]
He, Jianfeng [1 ]
Zhang, Liwei [2 ]
机构
[1] South China Univ Technol, Sch Econ & Finance, Guangzhou, Peoples R China
[2] Ping An Insurance Co China, Shenzhen, Peoples R China
关键词
Mass of grid cell; Mixed-resampling; Boundary domain; Random under-sampling; SUBDATA SELECTION; SMOTE;
D O I
10.1007/s00180-024-01471-8
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.
引用
收藏
页码:27 / 64
页数:38
相关论文
共 50 条
  • [31] Data reduction techniques for highly imbalanced medicare Big Data
    Hancock, John T.
    Wang, Huanjing
    Khoshgoftaar, Taghi M.
    Liang, Qianxin
    JOURNAL OF BIG DATA, 2024, 11 (01)
  • [32] SHAP as a Data Reduction Technique for Highly Imbalanced Big Data
    Hancock, John T.
    Bauder, Richard A.
    Khoshgoftaar, Taghi M.
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2025,
  • [33] Data reduction techniques for highly imbalanced medicare Big Data
    John T. Hancock
    Huanjing Wang
    Taghi M. Khoshgoftaar
    Qianxin Liang
    Journal of Big Data, 11
  • [34] Imbalanced big data classification based on virtual reality in cloud computing
    Wen-da Xie
    Xiaochun Cheng
    Multimedia Tools and Applications, 2020, 79 : 16403 - 16420
  • [35] A Dynamic Spark-based Classification Framework for Imbalanced Big Data
    Abdel-Hamid, Nahla B.
    ElGhamrawy, Sally
    El Desouky, Ali
    Arafat, Hesham
    JOURNAL OF GRID COMPUTING, 2018, 16 (04) : 607 - 626
  • [36] Evolutionary Undersampling for Imbalanced Big Data Classification
    Triguero, I.
    Galar, M.
    Vluymans, S.
    Cornelis, C.
    Bustince, H.
    Herrera, F.
    Saeys, Y.
    2015 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2015, : 715 - 722
  • [37] Imbalanced big data classification based on virtual reality in cloud computing
    Xie, Wen-da
    Cheng, Xiaochun
    MULTIMEDIA TOOLS AND APPLICATIONS, 2020, 79 (23-24) : 16403 - 16420
  • [38] A Dynamic Spark-based Classification Framework for Imbalanced Big Data
    Nahla B. Abdel-Hamid
    Sally ElGhamrawy
    Ali El Desouky
    Hesham Arafat
    Journal of Grid Computing, 2018, 16 : 607 - 626
  • [39] A Data Colocation Grid Framework for Big Data Medical Image Processing - Backend Design
    Bao, Shunxing
    Huo, Yuankai
    Parvathaneni, Prasanna
    Plassard, Andrew J.
    Bermudez, Camilo
    Yao, Yuang
    Lyu, Ilwoo
    Gokhale, Aniruddha
    Landman, Bennett A.
    MEDICAL IMAGING 2018: IMAGING INFORMATICS FOR HEALTHCARE, RESEARCH, AND APPLICATIONS, 2018, 10579
  • [40] Sampling for Big Data: A Tutorial
    Cormode, Graham
    Duffield, Nick
    PROCEEDINGS OF THE 20TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD'14), 2014, : 1975 - 1975