A survey on addressing high-class imbalance in big data

被引:426
|
作者
Leevy J.L. [1 ]
Khoshgoftaar T.M. [1 ]
Bauder R.A. [1 ]
Seliya N. [2 ]
机构
[1] Florida Atlantic University, Boca Raton
[2] Ohio Northern University, Ada
基金
美国国家科学基金会;
关键词
Big data; Cost-sensitive learners; Data sampling; High-class imbalance;
D O I
10.1186/s40537-018-0151-6
中图分类号
学科分类号
摘要
In a majority–minority classification problem, class imbalance in the dataset(s) can dramatically skew the performance of classifiers, introducing a prediction bias for the majority class. Assuming the positive (minority) class is the group of interest and the given application domain dictates that a false negative is much costlier than a false positive, a negative (majority) class prediction bias could have adverse consequences. With big data, the mitigation of class imbalance poses an even greater challenge because of the varied and complex structure of the relatively much larger datasets. This paper provides a large survey of published studies within the last 8 years, focusing on high-class imbalance (i.e., a majority-to-minority class ratio between 100:1 and 10,000:1) in big data in order to assess the state-of-the-art in addressing adverse effects due to class imbalance. In this paper, two techniques are covered which include Data-Level (e.g., data sampling) and Algorithm-Level (e.g., cost-sensitive and hybrid/ensemble) Methods. Data sampling methods are popular in addressing class imbalance, with Random Over-Sampling methods generally showing better overall results. At the Algorithm-Level, there are some outstanding performers. Yet, in the published studies, there are inconsistent and conflicting results, coupled with a limited scope in evaluated techniques, indicating the need for more comprehensive, comparative studies. © 2018, The Author(s).
引用
收藏
相关论文
共 50 条
  • [1] Data-Centric Solutions for Addressing Big Data Veracity with Class Imbalance, High Dimensionality, and Class Overlapping
    Bolivar, Armando
    Garcia, Vicente
    Alejo, Roberto
    Florencia-Juarez, Rogelio
    Sanchez, J. Salvador
    APPLIED SCIENCES-BASEL, 2024, 14 (13):
  • [2] Addressing class imbalance in functional data clustering
    Higgins, Catherine
    Carey, Michelle
    ADVANCES IN DATA ANALYSIS AND CLASSIFICATION, 2024,
  • [3] High-class lookers
    Woolley, S
    FORBES, 1999, 163 (06): : 236 - 236
  • [4] Not snobby, but high-class
    Willsher, Kim
    DU, 2022, (919): : 50 - 53
  • [5] High-class cookers
    Rossant, J
    FORBES, 2001, 167 (07): : 169 - 169
  • [6] ADDING TO COLLECTIONS - HIGH-CLASS GLASS, ONE BIG-RIG
    EIKE, CM
    MUSEUM NEWS, 1989, 68 (02): : 19 - 19
  • [7] Addressing the Big Data Multi-class Imbalance Problem with Oversampling and Deep Learning Neural Networks
    Gonzalez-Barcenas, V. M.
    Rendon, E.
    Alejo, R.
    Granda-Gutierrez, E. E.
    Valdovinos, R. M.
    PATTERN RECOGNITION AND IMAGE ANALYSIS, PT I, 2020, 11867 : 216 - 224
  • [8] Views of a high-class rebel
    Fell, N
    Melchett, P
    CHEMICAL ENGINEER-LONDON, 1997, (644): : 16 - 17
  • [9] High-class concrete products
    Jahn, Christian
    Betonwerk und Fertigteil-Technik/Concrete Plant and Precast Technology, 2017, 83 (09): : 38 - 45
  • [10] Online sparse class imbalance learning on big data
    Maurya, Chandresh Kumar
    Toshniwal, Durga
    Venkoparao, Gopalan Vijendran
    NEUROCOMPUTING, 2016, 216 : 250 - 260