Data Split Strategies for Evolving Predictive Models

被引:7
|
作者
Raykar, Vikas C. [1 ]
Saha, Amrita [1 ]
机构
[1] IBM Res, Bangalore, Karnataka, India
关键词
Data splits; Model assessment; Predictive models;
D O I
10.1007/978-3-319-23528-8_1
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
A conventional textbook prescription for building good predictive models is to split the data into three parts: training set (for model fitting), validation set (for model selection), and test set (for final model assessment). Predictive models can potentially evolve over time as developers improve their performance either by acquiring new data or improving the existing model. The main contribution of this paper is to discuss problems encountered and propose workflows to manage the allocation of newly acquired data into different sets in such dynamic model building and updating scenarios. Specifically we propose three different workflows (parallel dump, serial waterfall, and hybrid) for allocating new data into the existing training, validation, and test splits. Particular emphasis is laid on avoiding the bias due to the repeated use of the existing validation or the test set.
引用
收藏
页码:3 / 19
页数:17
相关论文
共 50 条
  • [21] Angel investors' predictive and control funding criteria: The importance of evolving business models
    Crick, James M.
    Crick, Dave
    JOURNAL OF RESEARCH IN MARKETING AND ENTREPRENEURSHIP, 2018, 20 (01) : 34 - 56
  • [22] Scalable, Updatable Predictive Models for Sequence Data
    Koul, Neeraj
    Bui, Ngot
    Honavar, Vasant
    2010 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2010, : 681 - 685
  • [23] Using predictive models to improve data quality of character data
    Ak, M
    Grossman, D
    Frieder, O
    McCabe, MC
    ISE'2001: PROCEEDINGS OF THE INTERNATIONAL SYMPOSIUM ON INFORMATION SYSTEMS AND ENGINEERING, 2001, : 229 - 234
  • [24] Data Augmentation for Compositional Data: Advancing Predictive Models of the Microbiome
    Gordon-Rodriguez, Elliott
    Quinn, Thomas P.
    Cunninghham, John P.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35, NEURIPS 2022, 2022,
  • [25] Reality Check: The Use of Big Data and Predictive Data Models
    Alvero, Kevin M.
    ISACA Journal, 2023, 1 : 16 - 19
  • [26] Data based predictive models for odor perception
    Chacko, Rinu
    Jain, Deepak
    Patwardhan, Manasi
    Puri, Abhishek
    Karande, Shirish
    Rai, Beena
    SCIENTIFIC REPORTS, 2020, 10 (01)
  • [27] Reconstructing historical habitat data with predictive models
    Zweig, Christa L.
    Kitchens, Wiley M.
    ECOLOGICAL APPLICATIONS, 2014, 24 (01) : 196 - 203
  • [28] Influential data points in predictive logistic models
    deMoraes, AR
    Dunsmore, IR
    STATISTICS AND COMPUTING, 1996, 6 (03) : 263 - 268
  • [29] Data based predictive models for odor perception
    Rinu Chacko
    Deepak Jain
    Manasi Patwardhan
    Abhishek Puri
    Shirish Karande
    Beena Rai
    Scientific Reports, 10
  • [30] PubChem BioAssays as a data source for predictive models
    Chen, Bin
    Wild, David J.
    JOURNAL OF MOLECULAR GRAPHICS & MODELLING, 2010, 28 (05): : 420 - 426