Approximating Learning Curves for Imbalanced Big Data with Limited Labels

被引:0
|
作者
Richter, Aaron N. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
关键词
learning curve; semi-supervised learning; limited labels; big data; class imbalance;
D O I
10.1109/ICTAI.2019.00041
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Labeling data for supervised learning can be an expensive task, especially when large amounts of data are required to build an adequate classifier. For most problems, there exists a point of diminishing returns on a learning curve where adding more data only marginally increases model performance. It would be beneficial to approximate this point for scenarios where there is a large amount of data available but only a small amount of labeled data. Then, time and resources can be spent wisely to label the sample that is required for acceptable model performance. In this study, we explore learning curve approximation methods on a big imbalanced dataset from the bioinformatics domain. We evaluate a curve fitting method developed on small data using an inverse power law model, and propose a new semi-supervised method to take advantage of the large amount of unlabeled data. We find that the traditional curve fitting method is not effective for large sample sizes, while the semi-supervised method more accurately identifies the point of diminishing returns.
引用
收藏
页码:237 / 242
页数:6
相关论文
共 50 条
  • [21] Incremental label propagation for data sets with imbalanced labels
    Li, Yaoxing
    Bai, Liang
    Liang, Zhuomin
    Du, Hangyuan
    [J]. NEUROCOMPUTING, 2023, 535 : 144 - 155
  • [22] Iterative cleaning and learning of big highly-imbalanced fraud data using unsupervised learning
    Robert K. L. Kennedy
    Zahra Salekshahrezaee
    Flavio Villanustre
    Taghi M. Khoshgoftaar
    [J]. Journal of Big Data, 10
  • [23] Smartwatch-Based Eating Detection: Data Selection for Machine Learning from Imbalanced Data with Imperfect Labels
    Stankoski, Simon
    Jordan, Marko
    Gjoreski, Hristijan
    Lustrek, Mitja
    [J]. SENSORS, 2021, 21 (05) : 1 - 25
  • [24] Transfer Learning for Optical and SAR Data Correspondence Identification With Limited Training Labels
    Zhang, Mengmeng
    Li, Wei
    Tao, Ran
    Wang, Song
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2021, 14 : 1545 - 1557
  • [25] Approximating Learning Curves for Active-Learning-Driven Annotation
    Tomanek, Katrin
    Hahn, Udo
    [J]. SIXTH INTERNATIONAL CONFERENCE ON LANGUAGE RESOURCES AND EVALUATION, LREC 2008, 2008, : 1319 - 1324
  • [26] Debiased Learning from Naturally Imbalanced Pseudo-Labels
    Wang, Xudong
    Wu, Zhirong
    Lian, Long
    Yu, Stella X.
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 14627 - 14637
  • [27] Rethinking the Value of Labels for Improving Class-Imbalanced Learning
    Yang, Yuzhe
    Xu, Zhi
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33
  • [28] The Regression Learning of the Imbalanced and Big Data by the Online Mixture Model for the Mach Number Forecasting
    Wang, Xiao-Jun
    Liu, Yan
    Yuan, Ping
    Zhou, Chang-Jun
    Zhang, Lin
    [J]. IEEE ACCESS, 2019, 7 : 7368 - 7380
  • [29] Evolutionary Undersampling for Imbalanced Big Data Classification
    Triguero, I.
    Galar, M.
    Vluymans, S.
    Cornelis, C.
    Bustince, H.
    Herrera, F.
    Saeys, Y.
    [J]. 2015 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2015, : 715 - 722
  • [30] Dealing with difficult minority labels in imbalanced mutilabel data sets
    Charte, Francisco
    Rivera, Antonio J.
    del Jesus, Maria J.
    Herrera, Francisco
    [J]. NEUROCOMPUTING, 2019, 326 : 39 - 53