On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

被引:0
|
作者
Mussmann, Stephen [1 ]
Jia, Robin [1 ]
Liang, Percy [1 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99:99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2:4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32:5% on QQP and 20:1% on WikiQA.
引用
收藏
页数:14
相关论文
共 50 条
  • [41] Adaptive multi-objective swarm fusion for imbalanced data classification
    li, Jinyan
    Fong, Simon
    Wong, Raymond K.
    Chu, Victor W.
    INFORMATION FUSION, 2018, 39 : 1 - 24
  • [42] Missing information in imbalanced data stream: fuzzy adaptive imputation approach
    Halder, Bohnishikha
    Ahmed, Md Manjur
    Amagasa, Toshiyuki
    Isa, Nor Ashidi Mat
    Faisal, Rahat Hossain
    Rahman, Md Mostafijur
    APPLIED INTELLIGENCE, 2022, 52 (05) : 5561 - 5583
  • [43] Adaptive Feature Generation for Online Continual Learning from Imbalanced Data
    Jian, Yingchun
    Yi, Jinfeng
    Zhang, Lijun
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PAKDD 2022, PT I, 2022, 13280 : 276 - 289
  • [44] KernelADASYN: Kernel Based Adaptive Synthetic Data Generation for Imbalanced Learning
    Tang, Bo
    He, Haibo
    2015 IEEE CONGRESS ON EVOLUTIONARY COMPUTATION (CEC), 2015, : 664 - 671
  • [45] Supporting tasks with adaptive groups in data parallel programming
    O'Donnell, John
    INTERNATIONAL JOURNAL OF COMPUTATIONAL SCIENCE AND ENGINEERING, 2005, 1 (2-4) : 86 - 98
  • [46] ADANOISE: Training neural networks with adaptive noise for imbalanced data classification
    Shin K.
    Kang S.
    Expert Systems with Applications, 2022, 192
  • [47] Missing information in imbalanced data stream: fuzzy adaptive imputation approach
    Bohnishikha Halder
    Md Manjur Ahmed
    Toshiyuki Amagasa
    Nor Ashidi Mat Isa
    Rahat Hossain Faisal
    Md. Mostafijur Rahman
    Applied Intelligence, 2022, 52 : 5561 - 5583
  • [48] A geometric approach to pairwise Bayesian alignment of functional data using importance sampling
    Kurtek, Sebastian
    ELECTRONIC JOURNAL OF STATISTICS, 2017, 11 (01): : 502 - 531
  • [49] Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data
    Liu, Jie
    SOFT COMPUTING, 2022, 26 (03) : 1141 - 1163
  • [50] Importance-SMOTE: a synthetic minority oversampling method for noisy imbalanced data
    Jie Liu
    Soft Computing, 2022, 26 : 1141 - 1163