Optimizing Data Collection for Machine Learning

被引:0
|
作者
Mahmood, Rafid [1 ]
Lucas, James [1 ]
Alvarez, Jose M. [1 ]
Fidler, Sanja [1 ,2 ,3 ]
Law, Marc T. [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Univ Toronto, Toronto, ON, Canada
[3] Vector Inst, Toronto, ON, Canada
关键词
POWER;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.
引用
收藏
页数:14
相关论文
共 50 条
  • [31] A Machine Learning Based Approach for Smart and Automated Data Collection: Applications in Transportation
    Shaurya Agarwal
    Saumya Gupta
    Pushkin Kachroo
    Nilesh Dhingra
    [J]. Transportation in Developing Economies, 2024, 10
  • [32] Topical collection on machine learning for big data analytics in smart healthcare systems
    Mian Ahmad Jan
    Houbing Song
    Fazlullah Khan
    Ateeq Ur Rehman
    Lie-Liang Yang
    [J]. Neural Computing and Applications, 2023, 35 : 14469 - 14471
  • [33] Data Collection and Exploratory Analysis for Cyber Threat Intelligence Machine Learning Processes
    Wolf, Shaya
    Foster, Rita
    Mack, Andrea
    Priest, Zachary
    Haile, Jed
    [J]. 2022 9TH SWISS CONFERENCE ON DATA SCIENCE (SDS), 2022, : 7 - 12
  • [34] Topical collection on machine learning for big data analytics in smart healthcare systems
    Jan, Mian Ahmad
    Song, Houbing
    Khan, Fazlullah
    Ur Rehman, Ateeq
    Yang, Lie-Liang
    [J]. NEURAL COMPUTING & APPLICATIONS, 2023, 35 (20): : 14469 - 14471
  • [35] Optimizing Waste Collection: A Data Mining Approach
    Londres, Guilherme
    Filipe, Nuno
    Gama, Joao
    [J]. MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES, ECML PKDD 2019, PT I, 2020, 1167 : 570 - 578
  • [36] Machine learning and BIM visualization for maintenance issue classification and enhanced data collection
    McArthur, J. J.
    Shahbazi, Nima
    Fok, Ricky
    Raghubar, Christopher
    Bortoluzzi, Brandon
    An, Aijun
    [J]. ADVANCED ENGINEERING INFORMATICS, 2018, 38 : 101 - 112
  • [37] HOPES: An Integrative Digital Phenotyping Platform for Data Collection, Monitoring, and Machine Learning
    Wang, Xuancong
    Vouk, Nikola
    Heaukulani, Creighton
    Buddhika, Thisum
    Martanto, Wijaya
    Lee, Jimmy
    Morris, Robert J. T.
    [J]. JOURNAL OF MEDICAL INTERNET RESEARCH, 2021, 23 (03)
  • [38] Implementation of Machine Learning for Breath Collection
    Santos, Paulo
    Vassilenko, Valentina
    Vasconcelos, Fabio
    Gil, Flavio
    [J]. PROCEEDINGS OF THE 10TH INTERNATIONAL JOINT CONFERENCE ON BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES, VOL 1: BIODEVICES, 2017, : 163 - 170
  • [39] Optimizing Field Data Collection for Individual Tree Attribute Predictions Using Active Learning Methods
    Malek, Salim
    Miglietta, Franco
    Gobakken, Terje
    Naesset, Erik
    Gianelle, Damiano
    Dalponte, Michele
    [J]. REMOTE SENSING, 2019, 11 (08)
  • [40] Optimizing the classification of biological tissues using machine learning models based on polarized data
    Rodriguez, Carla
    Estevez, Irene
    Gonzalez-Arnay, Emilio
    Campos, Juan
    Lizana, Angel
    [J]. JOURNAL OF BIOPHOTONICS, 2023, 16 (04)