An Intermediate Representation for Optimizing Machine Learning Pipelines

被引：22

作者：

Kunft, Andreas ^{[1
]}

Katsifodimos, Asterios ^{[2
]}

Schelter, Sebastian ^{[3
]}

Bress, Sebastian ^{[1
,4
]}

Rabl, Tilmann ^{[5
]}

Markl, Volker ^{[1
,4
]}

机构：

[1] TU Berlin, Berlin, Germany

[2] Delft Univ Technol, Delft, Netherlands

[3] NYU, New York, NY 10003 USA

[4] DFKI, Kaiserslautern, Germany

[5] Univ Potsdam, HPI, Potsdam, Germany

来源：

PROCEEDINGS OF THE VLDB ENDOWMENT | 2019年 / 12卷 / 11期

关键词：

SCALABLE LINEAR ALGEBRA; SYSTEMS; PLANS;

D O I：

10.14778/3342263.3342633

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's intermediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.

引用

页码：1553 / 1567

页数：15

共 50 条

[1] Optimizing Data Pipelines for Machine Learning in Feature Stores
Liu, Rui
Park, Kwanghyun
Psallidas, Fotis
Zhu, Xiaoyong
Mo, Jinghui
Sen, Rathijit
Interlandi, Matteo
Karanasos, Konstantinos
Tian, Yuanyuan
Camacho-Rodriguez, Jesus
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4230 - 4239
[2] An Intermediate Representation for Hybrid Database and Machine Learning Workloads
Shaikhha, Amir
Schleich, Maximilian
Olteanu, Dan
[J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12): : 2831 - 2834
[3] RVSDG: An Intermediate Representation for Optimizing Compilers
Reissmann, Nico
Meyer, Jan Christian
Bahmann, Helge
Sjalander, Magnus
[J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2020, 19 (06)
[4] On the Democratization of Machine Learning Pipelines
Carqueja, Alexandre
Cabral, Bruno
Fernandes, Joao Paulo
Lourenco, Nuno
[J]. 2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 455 - 462
[5] Debugging Machine Learning Pipelines
Lourenco, Raoni
Freire, Juliana
Shasha, Dennis
[J]. PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2019, 2019,
[6] Machine Learning-Driven Data Valuation for Optimizing High-Throughput Screening Pipelines
Hesse, Joshua
Boldini, Davide
Sieber, Stephan A.
[J]. Journal of Chemical Information and Modeling, 2024, 64 (21) : 8142 - 8152
[7] Data pricing in machine learning pipelines
Zicun Cong
Xuan Luo
Jian Pei
Feida Zhu
Yong Zhang
[J]. Knowledge and Information Systems, 2022, 64 : 1417 - 1455
[8] Data pricing in machine learning pipelines
Cong, Zicun
Luo, Xuan
Pei, Jian
Zhu, Feida
Zhang, Yong
[J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (06) : 1417 - 1455
[9] DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks
Santhanam, Keshav
Krishna, Siddharth
Tomioka, Ryota
Fitzgibbon, Andrew
Harris, Tim
[J]. PROCEEDINGS OF THE 1ST WORKSHOP ON MACHINE LEARNING AND SYSTEMS (EUROMLSYS'21), 2021, : 15 - 23
[10] mlr3pipelines-Flexible Machine Learning Pipelines in R
Binder, Martin
Pfisterer, Florian
Lang, Michel
Schneider, Lennart
Kotthoff, Lars
Bischl, Bernd
[J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2021, 22

← 1 2 3 4 5 →