Data Management for Machine Learning: A Survey

被引:15
|
作者
Chai, Chengliang [1 ]
Wang, Jiayi [1 ]
Luo, Yuyu [1 ]
Niu, Zeping [1 ]
Li, Guoliang [2 ]
机构
[1] Tsinghua Univ, Dept Comp Sci, Beijing 100190, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci, Tsinghua Natl Lab Informat Sci & Technol TNList, Beijing 100190, Peoples R China
关键词
Data models; Training; Computational modeling; Cleaning; Training data; Optimization; Task analysis; Database; machine learning; data preparation; model training; model inference; VISUALIZATION; AUGMENTATION; OPTIMIZATION; SYSTEM; MODEL;
D O I
10.1109/TKDE.2022.3148237
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Machine learning (ML) has widespread applications and has revolutionized many industries, but suffers from several challenges. First, sufficient high-quality training data is inevitable for producing a well-performed model, but the data is always human expensive to acquire. Second, a large amount of training data and complicated model structures lead to the inefficiency of training and inference. Third, given an ML task, one always needs to train lots of models, which are hard to manage in real applications. Fortunately, database techniques can benefit ML by addressing the above three challenges. In this paper, we review existing studies from the following three aspects along with the pipeline highly related to ML. (1) Data preparation (Pre-ML): it focuses on preparing high-quality training data that can improve the performance of the ML model, where we review data discovery, data cleaning and data labeling. (2) Model training & inference (In-ML): researchers in ML community focus on improving the model performance during training, while in this survey we mainly study how to accelerate the entire training process, also including feature selection and model selection. (3) Model management (Post-ML): in this part, we survey how to store, query, deploy and debug the models after training. Finally, we provide research challenges and future directions.
引用
收藏
页码:4646 / 4667
页数:22
相关论文
共 50 条
  • [1] Survey on Data Management Technology for Machine Learning
    Cui, Jian-Wei
    Zhao, Zhe
    Du, Xiao-Yong
    [J]. Ruan Jian Xue Bao/Journal of Software, 2021, 32 (03): : 604 - 621
  • [2] Asset Management in Machine Learning: A Survey
    Idowu, Samuel
    Struber, Daniel
    Berger, Thorsten
    [J]. 2021 IEEE/ACM 43RD INTERNATIONAL CONFERENCE ON SOFTWARE ENGINEERING: SOFTWARE ENGINEERING IN PRACTICE (ICSE-SEIP 2021), 2021, : 51 - 60
  • [3] Machine Learning in Warehouse Management: A Survey
    de Assis, Rodrigo Furlan
    Faria, Alexandre Frias
    Thomasset-Laperriere, Vincent
    Sanata-Eulalia, Luis Antonio
    Ouhimmou, PaulaMustapha
    Ferreira, William de Paula
    [J]. 5TH INTERNATIONAL CONFERENCE ON INDUSTRY 4.0 AND SMART MANUFACTURING, ISM 2023, 2024, 232 : 2790 - 2799
  • [4] A survey on machine learning for data fusion
    Meng, Tong
    Jing, Xuyang
    Yan, Zheng
    Pedrycz, Witold
    [J]. INFORMATION FUSION, 2020, 57 : 115 - 129
  • [5] A survey on missing data in machine learning
    Emmanuel, Tlamelo
    Maupong, Thabiso
    Mpoeleng, Dimane
    Semong, Thabo
    Mphago, Banyatsang
    Tabona, Oteng
    [J]. JOURNAL OF BIG DATA, 2021, 8 (01)
  • [6] A survey on missing data in machine learning
    Tlamelo Emmanuel
    Thabiso Maupong
    Dimane Mpoeleng
    Thabo Semong
    Banyatsang Mphago
    Oteng Tabona
    [J]. Journal of Big Data, 8
  • [7] Machine Learning for Financial Risk Management: A Survey
    Mashrur, Akib
    Luo, Wei
    Zaidi, Nayyar A.
    Robles-Kelly, Antonio
    [J]. IEEE ACCESS, 2020, 8 : 203203 - 203223
  • [8] Management of Machine Learning Lifecycle Artifacts: A Survey
    Schlegel, Marius
    Sattler, Kai-Uwe
    [J]. SIGMOD RECORD, 2022, 51 (04) : 18 - 35
  • [9] Data Management in Machine Learning Systems
    Boehm, Matthias
    Kumar, Arun
    Yang, Jun
    [J]. Synthesis Lectures on Data Management, 2019, 11 (01): : 1 - 173
  • [10] A survey of machine learning for big data processing
    Qiu, Junfei
    Wu, Qihui
    Ding, Guoru
    Xu, Yuhua
    Feng, Shuo
    [J]. EURASIP JOURNAL ON ADVANCES IN SIGNAL PROCESSING, 2016,