Can Machine Learning Pipelines Be Better Configured?

被引：0

作者：

Wang, Yibo ^{[1
]}

Wang, Ying ^{[1
,2
]}

Zhang, Tingwei ^{[1
]}

Yu, Yue ^{[3
]}

Cheung, Shing-Chi ^{[4
]}

Yu, Hai ^{[1
]}

Zhu, Zhiliang ^{[5
,6
]}

机构：

[1] Northeastern Univ, Shenyang, Peoples R China

[2] HKUST, Shenyang, Hong Kong, Peoples R China

[3] Natl Univ Def Technol, Changsha, Peoples R China

[4] Hong Kong Univ Sci & Technol, Hong Kong, Peoples R China

[5] Northeastern Univ, Natl Frontiers Sci Ctr Ind Intelligence & Syst Op, Shenyang, Peoples R China

[6] Northeastern Univ, Key Lab Data Analyt & Optimizat Smart Ind, Shenyang, Peoples R China

来源：

PROCEEDINGS OF THE 31ST ACM JOINT MEETING EUROPEAN SOFTWARE ENGINEERING CONFERENCE AND SYMPOSIUM ON THE FOUNDATIONS OF SOFTWARE ENGINEERING, ESEC/FSE 2023 | 2023年

基金：

中国国家自然科学基金;

关键词：

Machine Learning Libraries; Empirical Study; SIZE;

D O I：

10.1145/3611643.3616352

中图分类号：

TP31 [计算机软件];

学科分类号：

081202 ; 0835 ;

摘要：

A Machine Learning (ML) pipeline configures the workflow of a learning task using the APIs provided by ML libraries. However, a pipeline's performance can vary significantly across different configurations of ML library versions. Misconfigured pipelines can result in inferior performance, such as inefficient executions, numeric errors and even crashes. A pipeline is subject to misconfiguration if it exhibits significantly inconsistent performance upon changes in the versions of its configured libraries or the combination of these libraries. We refer to such performance inconsistency as a pipeline configuration (PLC) issue. A systematic understanding of PLC issues helps configure effective ML pipelines and identify misconfigured ones. To this end, we conduct the first empirical study of PLC issues' pervasiveness, impact and root causes. To facilitate scalable in-depth analysis, we develop Piecer, an infrastructure that automatically generates a set of pipeline variants by varying different version combinations of ML libraries and detects their performance inconsistencies. We apply Piecer to the 3,380 pipelines that can be deployed out of the 11,363 ML pipelines collected from multiple ML competitions at Kaggle platform. The empirical study results show that 1,092 (32.3%) of the 3,380 pipelines manifest significant performance inconsistencies on at least one variant. We find that 399, 243 and 440 pipelines can achieve better competition scores, execution time and memory usage, respectively, by adopting a different configuration. Based on our findings, we construct a repository containing 164 defective APIs and 106 API combinations from 418 library versions. The defective API repository facilitates future studies of automated detection techniques for PLC issues. Leveraging the repository, we captured PLC issues in 309 real-world ML pipelines.

引用

页码：463 / 475

页数：13

共 50 条

[1] Can Machine Learning Be Better than Biased Readers?
Hibi, Atsuhiro
Zhu, Rui
Tyrrell, Pascal N.
TOMOGRAPHY, 2023, 9 (03) : 901 - 908
[2] On the Democratization of Machine Learning Pipelines
Carqueja, Alexandre
Cabral, Bruno
Fernandes, Joao Paulo
Lourenco, Nuno
2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 455 - 462
[3] Debugging Machine Learning Pipelines
Lourenco, Raoni
Freire, Juliana
Shasha, Dennis
PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2019, 2019,
[4] Can Machine Learning Techniques Provide Better Learning Support for Elderly People?
Hatano, Kohei
DISTRIBUTED, AMBIENT AND PERVASIVE INTERACTIONS: TECHNOLOGIES AND CONTEXTS, DAPI 2018, PT II, 2018, 10922 : 178 - 187
[5] How Can Machine Learning and Optimization Help Each Other Better?
Zhou-Chen Lin
Journal of the Operations Research Society of China, 2020, 8 : 341 - 351
[6] Can Ensembling Preprocessing Algorithms Lead to Better Machine Learning Fairness?
Badran, Khaled
Cote, Pierre-Olivier
Kolopanis, Amanda
Bouchoucha, Rached
Collante, Antonio
Costa, Diego Elias
Shihab, Emad
Khomh, Foutse
COMPUTER, 2023, 56 (04) : 71 - 79
[7] How Can Machine Learning and Optimization Help Each Other Better?
Lin, Zhou-Chen
JOURNAL OF THE OPERATIONS RESEARCH SOCIETY OF CHINA, 2020, 8 (02) : 341 - 351
[8] Can machine learning on economic data better forecast the unemployment rate?
Kreiner, Aaron
Duca, John V.
APPLIED ECONOMICS LETTERS, 2020, 27 (17) : 1434 - 1437
[9] Data pricing in machine learning pipelines
Zicun Cong
Xuan Luo
Jian Pei
Feida Zhu
Yong Zhang
Knowledge and Information Systems, 2022, 64 : 1417 - 1455
[10] Data pricing in machine learning pipelines
Cong, Zicun
Luo, Xuan
Pei, Jian
Zhu, Feida
Zhang, Yong
KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (06) : 1417 - 1455

← 1 2 3 4 5 →