Imbalanced data sampling design based on grid boundary domain for big data

被引:2
|
作者
He, Hanji [1 ]
He, Jianfeng [1 ]
Zhang, Liwei [2 ]
机构
[1] South China Univ Technol, Sch Econ & Finance, Guangzhou, Peoples R China
[2] Ping An Insurance Co China, Shenzhen, Peoples R China
关键词
Mass of grid cell; Mixed-resampling; Boundary domain; Random under-sampling; SUBDATA SELECTION; SMOTE;
D O I
10.1007/s00180-024-01471-8
中图分类号
O21 [概率论与数理统计]; C8 [统计学];
学科分类号
020208 ; 070103 ; 0714 ;
摘要
The data distribution is often associated with a priori-known probability, and the occurrence probability of interest events is small, so a large amount of imbalanced data appears in sociology, economics, engineering, and various other fields. The existing over- and under-sampling methods are widely used in imbalanced data classification problems, but over-sampling leads to overfitting, and under-sampling ignores the effective information. We propose a new sampling design algorithm called the neighbor grid of boundary mixed-sampling (NGBM), which focuses on the boundary information. This paper obtains the classification boundary information through grid boundary domain identification, thereby determining the importance of the samples. Based on this premise, the synthetic minority oversampling technique is applied to the boundary grid, and random under-sampling is applied to the other grids. With the help of this mixed sampling strategy, more important classification boundary information, especially for positive sample information identification is extracted. Numerical simulations and real data analysis are used to discuss the parameter-setting strategy of the NGBM and illustrate the advantages of the proposed NGBM in the imbalanced data, as well as practical applications.
引用
收藏
页码:27 / 64
页数:38
相关论文
共 50 条
  • [1] Deep Learning and Data Sampling with Imbalanced Big Data
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    2019 IEEE 20TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2019), 2019, : 175 - 183
  • [2] HSDP: A Hybrid Sampling Method for Imbalanced Big Data Based on Data Partition
    Chen, Liping
    Jiang, Jiabao
    Zhang, Yong
    COMPLEXITY, 2021, 2021
  • [3] The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
    Justin M. Johnson
    Taghi M. Khoshgoftaar
    Information Systems Frontiers, 2020, 22 : 1113 - 1131
  • [4] The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    INFORMATION SYSTEMS FRONTIERS, 2020, 22 (05) : 1113 - 1131
  • [5] Severely imbalanced Big Data challenges: investigating data sampling approaches
    Hasanin, Tawfiq
    Khoshgoftaar, Taghi M.
    Leevy, Joffrey L.
    Bauder, Richard A.
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [6] Severely imbalanced Big Data challenges: investigating data sampling approaches
    Tawfiq Hasanin
    Taghi M. Khoshgoftaar
    Joffrey L. Leevy
    Richard A. Bauder
    Journal of Big Data, 6
  • [7] Data Sampling Approaches with Severely Imbalanced Big Data for Medicare Fraud Detection
    Bauder, Richard A.
    Khoshgoftaar, Taghi M.
    Hasanin, Tawfiq
    2018 IEEE 30TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE (ICTAI), 2018, : 137 - 142
  • [8] Imbalanced Data Set CSVM Classification Method Based on Cluster Boundary Sampling
    Li, Peng
    Liang, Tian-ge
    Zhang, Kai-hui
    MATHEMATICAL PROBLEMS IN ENGINEERING, 2016, 2016
  • [9] Hybrid sampling for imbalanced data
    Seiffert, Chris
    Khoshgoftaar, Taghi M.
    Van Hulse, Jason
    PROCEEDINGS OF THE 2008 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2008, : 202 - 207
  • [10] Hybrid sampling for imbalanced data
    Seiffert, Chris
    Khoshgoftaar, Taghi M.
    Van Hulse, Jason
    INTEGRATED COMPUTER-AIDED ENGINEERING, 2009, 16 (03) : 193 - 210