Optimizing Data Collection for Machine Learning

被引:0
|
作者
Mahmood, Rafid [1 ]
Lucas, James [1 ]
Alvarez, Jose M. [1 ]
Fidler, Sanja [1 ,2 ,3 ]
Law, Marc T. [1 ]
机构
[1] NVIDIA, Santa Clara, CA 95051 USA
[2] Univ Toronto, Toronto, ON, Canada
[3] Vector Inst, Toronto, ON, Canada
关键词
POWER;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern deep learning systems require huge data sets to achieve impressive performance, but there is little guidance on how much or what kind of data to collect. Over-collecting data incurs unnecessary present costs, while under-collecting may incur future costs and delay workflows. We propose a new paradigm for modeling the data collection workflow as a formal optimal data collection problem that allows designers to specify performance targets, collection costs, a time horizon, and penalties for failing to meet the targets. Additionally, this formulation generalizes to tasks requiring multiple data sources, such as labeled and unlabeled data used in semi-supervised learning. To solve our problem, we develop Learn-Optimize-Collect (LOC), which minimizes expected future collection costs. Finally, we numerically compare our framework to the conventional baseline of estimating data requirements by extrapolating from neural scaling laws. We significantly reduce the risks of failing to meet desired performance targets on several classification, segmentation, and detection tasks, while maintaining low total collection costs.
引用
收藏
页数:14
相关论文
共 50 条
  • [1] Prescriptive Method for Optimizing Cost of Data Collection and Annotation in Machine Learning of Clinical Ultrasound
    Lawley, Alistair
    Hampson, Rory
    Worrall, Kevin
    Dobie, Gordon
    [J]. 2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
  • [2] Delegating Data Collection in Decentralized Machine Learning
    Ananthakrishnan, Nivasini
    Bates, Stephen
    Jordan, Michael I.
    Haghtalab, Nika
    [J]. INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND STATISTICS, VOL 238, 2024, 238
  • [3] Optimizing Data Collection
    Roepstorff, Camilla
    Petersen, Heidi
    Berthelsen, Ann-Mari
    [J]. BIOLOGY OF BLOOD AND MARROW TRANSPLANTATION, 2015, 21 (02) : S265 - S265
  • [4] Optimizing Data Acquisition to Enhance Machine Learning Performance
    Wang, Tingting
    Huang, Shixun
    Bao, Zhifeng
    Culpepper, J. Shane
    Dedeoglu, Volkan
    Arablouei, Reza
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (06): : 1310 - 1323
  • [5] Optimizing Data Pipelines for Machine Learning in Feature Stores
    Liu, Rui
    Park, Kwanghyun
    Psallidas, Fotis
    Zhu, Xiaoyong
    Mo, Jinghui
    Sen, Rathijit
    Interlandi, Matteo
    Karanasos, Konstantinos
    Tian, Yuanyuan
    Camacho-Rodriguez, Jesus
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4230 - 4239
  • [6] Network Traffic Data Collection for Machine Learning Analysis
    Chao, James
    Rodriguez, Ramiro
    [J]. SPIE FUTURE SENSING TECHNOLOGIES 2023, 2023, 12327
  • [7] Optimizing data collection design
    Maney, JP
    [J]. ENVIRONMENTAL SCIENCE & TECHNOLOGY, 2002, 36 (19) : 383A - 389A
  • [8] Optimizing data collection design
    Maney, John P.
    [J]. Environmental Science and Technology, 2002, 36 (19):
  • [9] Demonstration of Santoku: Optimizing Machine Learning over Normalized Data
    Kumar, Arun
    Jalal, Mona
    Yan, Boqun
    Naughton, Jeffrey
    Patel, Jignesh M.
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2015, 8 (12): : 1865 - 1868
  • [10] Machine learning and data science in materials design: a themed collection
    Ferguson, Andrew
    Hachmann, Johannes
    [J]. MOLECULAR SYSTEMS DESIGN & ENGINEERING, 2018, 3 (03): : 429 - 430