Can Machine Learning Pipelines Be Better Configured?

被引：0

作者：

Wang, Yibo ^{[1
]}

Wang, Ying ^{[1
,2
]}

Zhang, Tingwei ^{[1
]}

Yu, Yue ^{[3
]}

Cheung, Shing-Chi ^{[4
]}

Yu, Hai ^{[1
]}

Zhu, Zhiliang ^{[5
,6
]}

机构：

[1] Northeastern Univ, Shenyang, Peoples R China

[2] HKUST, Shenyang, Hong Kong, Peoples R China

[3] Natl Univ Def Technol, Changsha, Peoples R China

[4] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[5] Northeastern Univ, Natl Frontiers Sci Ctr Ind Intelligence & Syst Op, Shenyang, Peoples R China

[6] Northeastern Univ, Key Lab Data Analyt & Optimizat Smart Ind, Shenyang, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Machine Learning Libraries; Empirical Study; SIZE;

D O I：

10.1145/3611643.3616352

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline's performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as inefficient executions, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue. A systematic understanding of PLC issues helps configure effective ML pipelines and identify misconfigured ones. To this end, we conduct the first empirical study of PLC issues' pervasiveness, impact and root causes. To facilitate scalable in-depth analysis, we develop Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and detects their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The empirical study results show that 1,092 (32.3%) of the 3,380 pipelines manifest significant performance inconsistencies on at least one variant. We find that 399, 243 and 440 pipelines can achieve better competition scores, execution time and memory usage, respectively, by adopting a different configuration. Based on our findings, we construct a repository containing 164 defective APIs and 106 API combinations from 418 library versions. The defective API repository facilitates future studies of automated detection techniques for PLC issues. Leveraging the repository, we captured PLC issues in 309 real-world ML pipelines.

引用

页码：463 / 475

页数：13

共 50 条

[21] Making Better Use of the Crowd: How Crowdsourcing Can Advance Machine Learning Research
Vaughan, Jennifer Wortman
JOURNAL OF MACHINE LEARNING RESEARCH, 2018, 18
[22] Optimizing Data Pipelines for Machine Learning in Feature Stores
Liu, Rui
Park, Kwanghyun
Psallidas, Fotis
Zhu, Xiaoyong
Mo, Jinghui
Sen, Rathijit
Interlandi, Matteo
Karanasos, Konstantinos
Tian, Yuanyuan
Camacho-Rodriguez, Jesus
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4230 - 4239
[23] MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines
Grafberger, Stefan
Guha, Shubha
Stoyanovich, Julia
Schelter, Sebastian
SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2736 - 2739
[24] A Prestudy of Machine Learning in Industrial Quality Control Pipelines
Ravnican, Joze
Marinko, Anze
Noveski, Gjorgji
Kalabakov, Stefan
Jovanovi, Marko
Gazvoda, Samo
Gams, Matjaz
INFORMATICA-AN INTERNATIONAL JOURNAL OF COMPUTING AND INFORMATICS, 2022, 46 (02): : 187 - 196
[25] Towards Accelerating Generic Machine Learning Prediction Pipelines
Scolari, Alberto
Lee, Yunseong
Weimer, Markus
Interlandi, Matteo
2017 IEEE 35TH INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 2017, : 431 - 434
[26] Applying Machine Learning to the Fuel Theft Problem on Pipelines
Ventriglia, Rachel Martins
Dantas, Leila Figueiredo
Brandao, Bianca
Hamacher, Silvio
Rocha, Marcos Vinicius Belle
David, Andre Silveira
Ribeiro, Frederico Chalita
JOURNAL OF PIPELINE SYSTEMS ENGINEERING AND PRACTICE, 2023, 14 (02)
[27] Review on automated condition assessment of pipelines with machine learning
Liu, Yiming
Bao, Yi
ADVANCED ENGINEERING INFORMATICS, 2022, 53
[28] Disdat: Bundle Data Management for Machine Learning Pipelines
Yocum, Ken
Rowan, Sean
Lunt, Jonathan
Wong, Theodore M.
PROCEEDINGS OF THE 2019 USENIX CONFERENCE ON OPERATIONAL MACHINE LEARNING, 2019, : 35 - 37
[29] Comparison of Machine Learning Pipelines for Gene Expression Matrices
Devino, Mateus
Belloze, Kele
Bezerra, Eduardo
ADVANCES IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, BSB 2022, 2022, 13523 : 32 - 37
[30] Machine Learning approach to corrosion assessment in subsea pipelines
De Masi, Giulia
Gentile, Manuela
Vichi, Roberta
Bruschi, Roberto
Gabetta, Giovanna
OCEANS 2015 - GENOVA, 2015,

← 1 2 3 4 5 →