Can Machine Learning Pipelines Be Better Configured?

被引:0
|
作者
Wang, Yibo [1 ]
Wang, Ying [1 ,2 ]
Zhang, Tingwei [1 ]
Yu, Yue [3 ]
Cheung, Shing-Chi [4 ]
Yu, Hai [1 ]
Zhu, Zhiliang [5 ,6 ]
机构
[1] Northeastern Univ, Shenyang, Peoples R China
[2] HKUST, Shenyang, Hong Kong, Peoples R China
[3] Natl Univ Def Technol, Changsha, Peoples R China
[4] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[5] Northeastern Univ, Natl Frontiers Sci Ctr Ind Intelligence & Syst Op, Shenyang, Peoples R China
[6] Northeastern Univ, Key Lab Data Analyt & Optimizat Smart Ind, Shenyang, Peoples R China
基金
中国国家自然科学基金;
关键词
Machine Learning Libraries; Empirical Study; SIZE;
D O I
10.1145/3611643.3616352
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline's performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as inefficient executions, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue. A systematic understanding of PLC issues helps configure effective ML pipelines and identify misconfigured ones. To this end, we conduct the first empirical study of PLC issues' pervasiveness, impact and root causes. To facilitate scalable in-depth analysis, we develop Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and detects their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The empirical study results show that 1,092 (32.3%) of the 3,380 pipelines manifest significant performance inconsistencies on at least one variant. We find that 399, 243 and 440 pipelines can achieve better competition scores, execution time and memory usage, respectively, by adopting a different configuration. Based on our findings, we construct a repository containing 164 defective APIs and 106 API combinations from 418 library versions. The defective API repository facilitates future studies of automated detection techniques for PLC issues. Leveraging the repository, we captured PLC issues in 309 real-world ML pipelines.
引用
收藏
页码:463 / 475
页数:13
相关论文
共 50 条
  • [1] Can Machine Learning Be Better than Biased Readers?
    Hibi, Atsuhiro
    Zhu, Rui
    Tyrrell, Pascal N.
    TOMOGRAPHY, 2023, 9 (03) : 901 - 908
  • [2] On the Democratization of Machine Learning Pipelines
    Carqueja, Alexandre
    Cabral, Bruno
    Fernandes, Joao Paulo
    Lourenco, Nuno
    2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 455 - 462
  • [3] Debugging Machine Learning Pipelines
    Lourenco, Raoni
    Freire, Juliana
    Shasha, Dennis
    PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2019, 2019,
  • [4] Can Machine Learning Techniques Provide Better Learning Support for Elderly People?
    Hatano, Kohei
    DISTRIBUTED, AMBIENT AND PERVASIVE INTERACTIONS: TECHNOLOGIES AND CONTEXTS, DAPI 2018, PT II, 2018, 10922 : 178 - 187
  • [5] How Can Machine Learning and Optimization Help Each Other Better?
    Zhou-Chen Lin
    Journal of the Operations Research Society of China, 2020, 8 : 341 - 351
  • [6] Can Ensembling Preprocessing Algorithms Lead to Better Machine Learning Fairness?
    Badran, Khaled
    Cote, Pierre-Olivier
    Kolopanis, Amanda
    Bouchoucha, Rached
    Collante, Antonio
    Costa, Diego Elias
    Shihab, Emad
    Khomh, Foutse
    COMPUTER, 2023, 56 (04) : 71 - 79
  • [7] How Can Machine Learning and Optimization Help Each Other Better?
    Lin, Zhou-Chen
    JOURNAL OF THE OPERATIONS RESEARCH SOCIETY OF CHINA, 2020, 8 (02) : 341 - 351
  • [8] Can machine learning on economic data better forecast the unemployment rate?
    Kreiner, Aaron
    Duca, John V.
    APPLIED ECONOMICS LETTERS, 2020, 27 (17) : 1434 - 1437
  • [9] Data pricing in machine learning pipelines
    Zicun Cong
    Xuan Luo
    Jian Pei
    Feida Zhu
    Yong Zhang
    Knowledge and Information Systems, 2022, 64 : 1417 - 1455
  • [10] Data pricing in machine learning pipelines
    Cong, Zicun
    Luo, Xuan
    Pei, Jian
    Zhu, Feida
    Zhang, Yong
    KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (06) : 1417 - 1455