An Intermediate Representation for Optimizing Machine Learning Pipelines

被引:22
|
作者
Kunft, Andreas [1 ]
Katsifodimos, Asterios [2 ]
Schelter, Sebastian [3 ]
Bress, Sebastian [1 ,4 ]
Rabl, Tilmann [5 ]
Markl, Volker [1 ,4 ]
机构
[1] TU Berlin, Berlin, Germany
[2] Delft Univ Technol, Delft, Netherlands
[3] NYU, New York, NY 10003 USA
[4] DFKI, Kaiserslautern, Germany
[5] Univ Potsdam, HPI, Potsdam, Germany
来源
PROCEEDINGS OF THE VLDB ENDOWMENT | 2019年 / 12卷 / 11期
关键词
SCALABLE LINEAR ALGEBRA; SYSTEMS; PLANS;
D O I
10.14778/3342263.3342633
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Machine learning (ML) pipelines for model training and validation typically include preprocessing, such as data cleaning and feature engineering, prior to training an ML model. Preprocessing combines relational algebra and user-defined functions (UDFs), while model training uses iterations and linear algebra. Current systems are tailored to either of the two. As a consequence, preprocessing and ML steps are optimized in isolation. To enable holistic optimization of ML training pipelines, we present Lara, a declarative domain-specific language for collections and matrices. Lara's intermediate representation (IR) reflects on the complete program, i.e., UDFs, control flow, and both data types. Two views on the IR enable diverse optimizations. Monads enable operator pushdown and fusion across type and loop boundaries. Combinators provide the semantics of domain-specific operators and optimize data access and cross-validation of ML algorithms. Our experiments on preprocessing pipelines and selected ML algorithms show the effects of our proposed optimizations on dense and sparse data, which achieve speedups of up to an order of magnitude.
引用
收藏
页码:1553 / 1567
页数:15
相关论文
共 50 条
  • [1] Optimizing Data Pipelines for Machine Learning in Feature Stores
    Liu, Rui
    Park, Kwanghyun
    Psallidas, Fotis
    Zhu, Xiaoyong
    Mo, Jinghui
    Sen, Rathijit
    Interlandi, Matteo
    Karanasos, Konstantinos
    Tian, Yuanyuan
    Camacho-Rodriguez, Jesus
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4230 - 4239
  • [2] An Intermediate Representation for Hybrid Database and Machine Learning Workloads
    Shaikhha, Amir
    Schleich, Maximilian
    Olteanu, Dan
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (12): : 2831 - 2834
  • [3] RVSDG: An Intermediate Representation for Optimizing Compilers
    Reissmann, Nico
    Meyer, Jan Christian
    Bahmann, Helge
    Sjalander, Magnus
    [J]. ACM TRANSACTIONS ON EMBEDDED COMPUTING SYSTEMS, 2020, 19 (06)
  • [4] On the Democratization of Machine Learning Pipelines
    Carqueja, Alexandre
    Cabral, Bruno
    Fernandes, Joao Paulo
    Lourenco, Nuno
    [J]. 2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 455 - 462
  • [5] Debugging Machine Learning Pipelines
    Lourenco, Raoni
    Freire, Juliana
    Shasha, Dennis
    [J]. PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2019, 2019,
  • [6] Machine Learning-Driven Data Valuation for Optimizing High-Throughput Screening Pipelines
    Hesse, Joshua
    Boldini, Davide
    Sieber, Stephan A.
    [J]. Journal of Chemical Information and Modeling, 2024, 64 (21) : 8142 - 8152
  • [7] Data pricing in machine learning pipelines
    Zicun Cong
    Xuan Luo
    Jian Pei
    Feida Zhu
    Yong Zhang
    [J]. Knowledge and Information Systems, 2022, 64 : 1417 - 1455
  • [8] Data pricing in machine learning pipelines
    Cong, Zicun
    Luo, Xuan
    Pei, Jian
    Zhu, Feida
    Zhang, Yong
    [J]. KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (06) : 1417 - 1455
  • [9] DistIR: An Intermediate Representation for Optimizing Distributed Neural Networks
    Santhanam, Keshav
    Krishna, Siddharth
    Tomioka, Ryota
    Fitzgibbon, Andrew
    Harris, Tim
    [J]. PROCEEDINGS OF THE 1ST WORKSHOP ON MACHINE LEARNING AND SYSTEMS (EUROMLSYS'21), 2021, : 15 - 23
  • [10] mlr3pipelines-Flexible Machine Learning Pipelines in R
    Binder, Martin
    Pfisterer, Florian
    Lang, Michel
    Schneider, Lennart
    Kotthoff, Lars
    Bischl, Bernd
    [J]. JOURNAL OF MACHINE LEARNING RESEARCH, 2021, 22