On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

被引:0
|
作者
Mussmann, Stephen [1 ]
Jia, Robin [1 ]
Liang, Percy [1 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99:99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2:4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32:5% on QQP and 20:1% on WikiQA.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] An Adaptive Data Collection Procedure for Call Prioritization
    Beaumont, Jean-Francois
    Bocci, Cynthia
    Haziza, David
    JOURNAL OF OFFICIAL STATISTICS, 2014, 30 (04) : 607 - 621
  • [32] AN ADAPTIVE SYSTEM OF EXPERIMENTAL-DATA COLLECTION
    CHURSINOV, VA
    AFONSKI, AA
    PSIKHOLOGICHESKII ZHURNAL, 1983, 4 (06) : 119 - 121
  • [33] Improved reversible data hiding based on PVO and adaptive pairwise embedding
    Wu, Haorui
    Li, Xiaolong
    Zhao, Yao
    Ni, Rongrong
    JOURNAL OF REAL-TIME IMAGE PROCESSING, 2019, 16 (03) : 685 - 695
  • [34] Improved reversible data hiding based on PVO and adaptive pairwise embedding
    Haorui Wu
    Xiaolong Li
    Yao Zhao
    Rongrong Ni
    Journal of Real-Time Image Processing, 2019, 16 : 685 - 695
  • [35] Adaptive JPEG Reversible Data Hiding Method Based on Pairwise Coefficients
    Wu T.-Y.
    Huang F.-J.
    Ruan Jian Xue Bao/Journal of Software, 2022, 33 (02): : 725 - 737
  • [36] An Adaptive Sampling Ensemble Classifier for Learning from Imbalanced Data Sets
    Geiler, Ordonez Jon
    Hong, Li
    Yue-Jian, Guo
    INTERNATIONAL MULTICONFERENCE OF ENGINEERS AND COMPUTER SCIENTISTS (IMECS 2010), VOLS I-III, 2010, : 513 - 517
  • [37] Adaptive Ensemble Method Based on Spatial Characteristics for Classifying Imbalanced Data
    Wang, Lei
    Zhao, Lei
    Gui, Guan
    Zheng, Baoyu
    Huang, Ruochen
    SCIENTIFIC PROGRAMMING, 2017, 2017
  • [38] Multiple adaptive over-sampling for imbalanced data evidential classification
    Zhang, Zhen
    Tian, Hong -peng
    Jin, Jin-shuai
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2024, 133
  • [39] The importance of data collection for timely and accurate risk assessment
    Gilsenan, M. B.
    59TH INTERNATIONAL MEAT INDUSTRY CONFERENCE MEATCON2017, 2017, 85
  • [40] A hybrid adaptive approach for instance transfer learning with dynamic and imbalanced data
    Zhang, Xiangzhou
    Liu, Kang
    Yuan, Borong
    Wang, Hongnian
    Chen, Shaoyong
    Xue, Yunfei
    Chen, Weiqi
    Liu, Mei
    Hu, Yong
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2022, 37 (12) : 11582 - 11599