Approximating Learning Curves for Imbalanced Big Data with Limited Labels

被引:0
|
作者
Richter, Aaron N. [1 ]
Khoshgoftaar, Taghi M. [1 ]
机构
[1] Florida Atlantic Univ, Boca Raton, FL 33431 USA
关键词
learning curve; semi-supervised learning; limited labels; big data; class imbalance;
D O I
10.1109/ICTAI.2019.00041
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Labeling data for supervised learning can be an expensive task, especially when large amounts of data are required to build an adequate classifier. For most problems, there exists a point of diminishing returns on a learning curve where adding more data only marginally increases model performance. It would be beneficial to approximate this point for scenarios where there is a large amount of data available but only a small amount of labeled data. Then, time and resources can be spent wisely to label the sample that is required for acceptable model performance. In this study, we explore learning curve approximation methods on a big imbalanced dataset from the bioinformatics domain. We evaluate a curve fitting method developed on small data using an inverse power law model, and propose a new semi-supervised method to take advantage of the large amount of unlabeled data. We find that the traditional curve fitting method is not effective for large sample sizes, while the semi-supervised method more accurately identifies the point of diminishing returns.
引用
收藏
页码:237 / 242
页数:6
相关论文
共 50 条
  • [1] Learning evolving prototypes for imbalanced data stream classification with limited labels
    Wu, Zhonglin
    Wang, Hongliang
    Guo, Jingxia
    Yang, Qinli
    Shao, Junming
    [J]. INFORMATION SCIENCES, 2024, 679
  • [2] Deep Learning and Data Sampling with Imbalanced Big Data
    Johnson, Justin M.
    Khoshgoftaar, Taghi M.
    [J]. 2019 IEEE 20TH INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION FOR DATA SCIENCE (IRI 2019), 2019, : 175 - 183
  • [3] Sample size determination for biomedical big data with limited labels
    Richter, Aaron N.
    Khoshgoftaar, Taghi M.
    [J]. NETWORK MODELING AND ANALYSIS IN HEALTH INFORMATICS AND BIOINFORMATICS, 2020, 9 (01):
  • [4] Sample size determination for biomedical big data with limited labels
    Aaron N. Richter
    Taghi M. Khoshgoftaar
    [J]. Network Modeling Analysis in Health Informatics and Bioinformatics, 2020, 9
  • [5] Sentiment analysis on big sparse data streams with limited labels
    Vasileios Iosifidis
    Eirini Ntoutsi
    [J]. Knowledge and Information Systems, 2020, 62 : 1393 - 1432
  • [6] Sentiment analysis on big sparse data streams with limited labels
    Iosifidis, Vasileios
    Ntoutsi, Eirini
    [J]. Knowledge and Information Systems, 2020, 62 (04): : 1393 - 1432
  • [7] Sentiment analysis on big sparse data streams with limited labels
    Iosifidis, Vasileios
    Ntoutsi, Eirini
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2020, 62 (04) : 1393 - 1432
  • [8] An Effective Integrated Method for Learning Big Imbalanced Data
    Ghanavati, Mojgan
    Wong, Raymond K.
    Chen, Fang
    Wang, Yang
    Perng, Chang-Shing
    [J]. 2014 IEEE INTERNATIONAL CONGRESS ON BIG DATA (BIGDATA CONGRESS), 2014, : 691 - 698
  • [9] Distributed and Weighted Extreme Learning Machine for Imbalanced Big Data Learning
    Zhiqiong Wang
    Junchang Xin
    Hongxu Yang
    Shuo Tian
    Ge Yu
    Chenren Xu
    Yudong Yao
    [J]. Tsinghua Science and Technology, 2017, 22 (02) : 160 - 173
  • [10] The Effects of Data Sampling with Deep Learning and Highly Imbalanced Big Data
    Justin M. Johnson
    Taghi M. Khoshgoftaar
    [J]. Information Systems Frontiers, 2020, 22 : 1113 - 1131