Data Lifecycle Challenges in Production Machine Learning: A Survey

被引:94
|
作者
Polyzotis, Neoklis [1 ]
Roy, Sudip [1 ]
Whang, Steven Euijong [1 ]
Zinkevich, Martin [2 ,3 ]
机构
[1] Google Res, Mountain View, CA USA
[2] Korea Adv Inst Sci & Technol, Daejeon, South Korea
[3] Google, Mountain View, CA USA
关键词
ANALYTICS; SELECTION; MODEL;
D O I
10.1145/3299887.3299891
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning has become an essential tool for gleaning knowledge from data and tackling a diverse set of computationally hard tasks. However, the accuracy of a machine learned model is deeply tied to the data that it is trained on. Designing and building robust processes and tools that make it easier to analyze, validate, and transform data that is fed into large-scale machine learning systems poses data management challenges. Drawn from our experience in developing data-centric infrastructure for a production machine learning platform at Google, we summarize some of the interesting research challenges that we encountered, and survey some of the relevant literature from the data management and machine learning communities. Specifically, we explore challenges in three main areas of focus data understanding, data validation and cleaning, and data preparation. In each of these areas, we try to explore how different constraints are imposed on the solutions depending on where in the lifecycle of a model the problems are encountered and who encounters them.
引用
收藏
页码:17 / 28
页数:12
相关论文
共 50 条
  • [1] Data Management Challenges in Production Machine Learning
    Polyzotis, Neoklis
    Roy, Sudip
    Whang, Steven Euijong
    Zinkevich, Martin
    [J]. SIGMOD'17: PROCEEDINGS OF THE 2017 ACM INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2017, : 1723 - 1726
  • [2] Management of Machine Learning Lifecycle Artifacts: A Survey
    Schlegel, Marius
    Sattler, Kai-Uwe
    [J]. SIGMOD RECORD, 2022, 51 (04) : 18 - 35
  • [3] Assuring the Machine Learning Lifecycle: Desiderata, Methods, and Challenges
    Ashmore, Rob
    Calinescu, Radu
    Paterson, Colin
    [J]. ACM COMPUTING SURVEYS, 2021, 54 (05)
  • [4] Managing Distributed Machine Learning Lifecycle for Healthcare Data in the Cloud
    Zeydan, Engin
    Arslan, Suayb S.
    Liyanage, Madhusanka
    [J]. IEEE ACCESS, 2024, 12 : 115750 - 115774
  • [5] Provenance Data in the Machine Learning Lifecycle in Computational Science and Engineering
    Souza, Renan
    Azevedo, Leonardo
    Lourenco, Vitor
    Soares, Elton
    Thiago, Raphael
    Brandao, Rafael
    Civitarese, Daniel
    Brazil, Emilio Vital
    Moreno, Marcio
    Valduriez, Patrick
    Mattoso, Marta
    Cerqueira, Renato
    Netto, Marco A. S.
    [J]. PROCEEDINGS OF WORKS19: THE 2019 14TH IEEE/ACM WORKFLOWS IN SUPPORT OF LARGE-SCALE SCIENCE (WORKS), 2019, : 1 - 10
  • [6] Data Management for Machine Learning: A Survey
    Chai, Chengliang
    Wang, Jiayi
    Luo, Yuyu
    Niu, Zeping
    Li, Guoliang
    [J]. IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (05) : 4646 - 4667
  • [7] A survey on machine learning for data fusion
    Meng, Tong
    Jing, Xuyang
    Yan, Zheng
    Pedrycz, Witold
    [J]. INFORMATION FUSION, 2020, 57 : 115 - 129
  • [8] A survey on missing data in machine learning
    Emmanuel, Tlamelo
    Maupong, Thabiso
    Mpoeleng, Dimane
    Semong, Thabo
    Mphago, Banyatsang
    Tabona, Oteng
    [J]. JOURNAL OF BIG DATA, 2021, 8 (01)
  • [9] A survey on missing data in machine learning
    Tlamelo Emmanuel
    Thabiso Maupong
    Dimane Mpoeleng
    Thabo Semong
    Banyatsang Mphago
    Oteng Tabona
    [J]. Journal of Big Data, 8
  • [10] A Survey on Machine Learning-Based Mobile Big Data Analysis: Challenges and Applications
    Xie, Jiyang
    Song, Zeyu
    Li, Yupeng
    Zhang, Yanting
    Yu, Hong
    Zhan, Jinnan
    Ma, Zhanyu
    Qiao, Yuanyuan
    Zhang, Jianhua
    Guo, Jun
    [J]. WIRELESS COMMUNICATIONS & MOBILE COMPUTING, 2018,