Can Machine Learning Pipelines Be Better Configured?

被引:0
|
作者
Wang, Yibo [1 ]
Wang, Ying [1 ,2 ]
Zhang, Tingwei [1 ]
Yu, Yue [3 ]
Cheung, Shing-Chi [4 ]
Yu, Hai [1 ]
Zhu, Zhiliang [5 ,6 ]
机构
[1] Northeastern Univ, Shenyang, Peoples R China
[2] HKUST, Shenyang, Hong Kong, Peoples R China
[3] Natl Univ Def Technol, Changsha, Peoples R China
[4] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China
[5] Northeastern Univ, Natl Frontiers Sci Ctr Ind Intelligence & Syst Op, Shenyang, Peoples R China
[6] Northeastern Univ, Key Lab Data Analyt & Optimizat Smart Ind, Shenyang, Peoples R China
基金
中国国家自然科学基金;
关键词
Machine Learning Libraries; Empirical Study; SIZE;
D O I
10.1145/3611643.3616352
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline's performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as inefficient executions, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue. A systematic understanding of PLC issues helps configure effective ML pipelines and identify misconfigured ones. To this end, we conduct the first empirical study of PLC issues' pervasiveness, impact and root causes. To facilitate scalable in-depth analysis, we develop Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and detects their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The empirical study results show that 1,092 (32.3%) of the 3,380 pipelines manifest significant performance inconsistencies on at least one variant. We find that 399, 243 and 440 pipelines can achieve better competition scores, execution time and memory usage, respectively, by adopting a different configuration. Based on our findings, we construct a repository containing 164 defective APIs and 106 API combinations from 418 library versions. The defective API repository facilitates future studies of automated detection techniques for PLC issues. Leveraging the repository, we captured PLC issues in 309 real-world ML pipelines.
引用
收藏
页码:463 / 475
页数:13
相关论文
共 50 条
  • [21] Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research
    Vaughan, Jennifer Wortman
    JOURNAL OF MACHINE LEARNING RESEARCH, 2018, 18
  • [22] Optimizing Data Pipelines for Machine Learning in Feature Stores
    Liu, Rui
    Park, Kwanghyun
    Psallidas, Fotis
    Zhu, Xiaoyong
    Mo, Jinghui
    Sen, Rathijit
    Interlandi, Matteo
    Karanasos, Konstantinos
    Tian, Yuanyuan
    Camacho-Rodriguez, Jesus
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4230 - 4239
  • [23] MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines
    Grafberger, Stefan
    Guha, Shubha
    Stoyanovich, Julia
    Schelter, Sebastian
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2736 - 2739
  • [24] A Prestudy of Machine Learning in Industrial Quality Control Pipelines
    Ravnican, Joze
    Marinko, Anze
    Noveski, Gjorgji
    Kalabakov, Stefan
    Jovanovi, Marko
    Gazvoda, Samo
    Gams, Matjaz
    INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (02): : 187 - 196
  • [25] Towards Accelerating Generic Machine Learning Prediction Pipelines
    Scolari, Alberto
    Lee, Yunseong
    Weimer, Markus
    Interlandi, Matteo
    2017 IEEE 35TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2017, : 431 - 434
  • [26] Applying Machine Learning to the Fuel Theft Problem on Pipelines
    Ventriglia, Rachel Martins
    Dantas, Leila Figueiredo
    Brandao, Bianca
    Hamacher, Silvio
    Rocha, Marcos Vinicius Belle
    David, Andre Silveira
    Ribeiro, Frederico Chalita
    JOURNAL OF PIPELINE SYSTEMS ENGINEERING AND PRACTICE, 2023, 14 (02)
  • [27] Review on automated condition assessment of pipelines with machine learning
    Liu, Yiming
    Bao, Yi
    ADVANCED ENGINEERING INFORMATICS, 2022, 53
  • [28] Disdat: Bundle Data Management for Machine Learning Pipelines
    Yocum, Ken
    Rowan, Sean
    Lunt, Jonathan
    Wong, Theodore M.
    PROCEEDINGS OF THE 2019 USENIX CONFERENCE ON OPERATIONAL MACHINE LEARNING, 2019, : 35 - 37
  • [29] Comparison of Machine Learning Pipelines for Gene Expression Matrices
    Devino, Mateus
    Belloze, Kele
    Bezerra, Eduardo
    ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, BSB 2022, 2022, 13523 : 32 - 37
  • [30] Machine Learning approach to corrosion assessment in subsea pipelines
    De Masi, Giulia
    Gentile, Manuela
    Vichi, Roberta
    Bruschi, Roberto
    Gabetta, Giovanna
    OCEANS 2015 - GENOVA, 2015,