On the Importance of Adaptive Data Collection for Extremely Imbalanced Pairwise Tasks

被引:0
|
作者
Mussmann, Stephen [1 ]
Jia, Robin [1 ]
Liang, Percy [1 ]
机构
[1] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Many pairwise classification tasks, such as paraphrase detection and open-domain question answering, naturally have extreme label imbalance (e.g., 99:99% of examples are negatives). In contrast, many recent datasets heuristically choose examples to ensure label balance. We show that these heuristics lead to trained models that generalize poorly: State-of-the art models trained on QQP and WikiQA each have only 2:4% average precision when evaluated on realistically imbalanced test data. We instead collect training data with active learning, using a BERT-based embedding model to efficiently retrieve uncertain points from a very large pool of unlabeled utterance pairs. By creating balanced training data with more informative negative examples, active learning greatly improves average precision to 32:5% on QQP and 20:1% on WikiQA.
引用
收藏
页数:14
相关论文
共 50 条
  • [21] A new data complexity measure for multi-class imbalanced classification tasks
    Han, Mingming
    Guo, Husheng
    Wang, Wenjian
    PATTERN RECOGNITION, 2025, 157
  • [22] ReMAHA-CatBoost: Addressing Imbalanced Data in Traffic Accident Prediction Tasks
    Li, Guolian
    Wu, Yadong
    Bai, Yulong
    Zhang, Weihan
    APPLIED SCIENCES-BASEL, 2023, 13 (24):
  • [23] Adaptive Methods for Classification in Arbitrarily Imbalanced and Drifting Data Streams
    Lichtenwalter, Ryan N.
    Chawla, Nitesh V.
    NEW FRONTIERS IN APPLIED DATA MINING, 2010, 5669 : 53 - 75
  • [24] ADAPTIVE DATA REUSE FOR CLASSIFYING IMBALANCED AND CONCEPT-DRIFTING DATA STREAMS
    Nguyen, Hien M.
    Cooper, Eric W.
    Kamei, Katsuari
    INTERNATIONAL JOURNAL OF INNOVATIVE COMPUTING INFORMATION AND CONTROL, 2012, 8 (7B): : 4995 - 5010
  • [25] Online Adaptive Asymmetric Active Learning for Budgeted Imbalanced Data
    Zhang, Yifan
    Zhao, Peilin
    Cao, Jiezhang
    Ma, Wenye
    Huang, Junzhou
    Wu, Qingyao
    Tan, Mingkui
    KDD'18: PROCEEDINGS OF THE 24TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2018, : 2768 - 2777
  • [26] Adaptive Learning in Imbalanced Data Streams With Unpredictable Feature Evolution
    Tu, Jiahang
    Tang, Xijia
    Gu, Shilin
    Dai, Yucong
    Fan, Ruidong
    Hou, Chenping
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2025, 37 (04) : 1527 - 1541
  • [27] Overview and Importance of Data Quality for Machine Learning Tasks
    Jain, Abhinav
    Patel, Hima
    Nagalapatti, Lokesh
    Gupta, Nitin
    Mehta, Sameep
    Guttula, Shanmukha
    Mujumdar, Shashank
    Afzal, Shazia
    Mittal, Ruhi Sharma
    Munigala, Vitobha
    KDD '20: PROCEEDINGS OF THE 26TH ACM SIGKDD INTERNATIONAL CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2020, : 3561 - 3562
  • [28] K -Segments Under Bagging approach: An experimental Study on Extremely Imbalanced Data Classification
    Tuan Tran
    Loc Tran
    An Mai
    ISCIT 2019: PROCEEDINGS OF 2019 19TH INTERNATIONAL SYMPOSIUM ON COMMUNICATIONS AND INFORMATION TECHNOLOGIES (ISCIT), 2019, : 492 - 495
  • [29] Adaptive Polling for Underwater Data Collection Systems
    Liu, Wei
    Weaver, Jeffrey
    Whelan, Terry
    Bagrodia, Rajive
    Forero, Pedro A.
    Chavez, Jose
    Capella, Matthew
    OCEANS 2018 MTS/IEEE CHARLESTON, 2018,
  • [30] An Adaptive Network Data Collection System in SDN
    Zhou, Donghao
    Yan, Zheng
    Liu, Gao
    Atiquzzaman, Mohammed
    IEEE TRANSACTIONS ON COGNITIVE COMMUNICATIONS AND NETWORKING, 2020, 6 (02) : 562 - 574