Disdat: Bundle Data Management for Machine Learning Pipelines

被引:0
|
作者
Yocum, Ken [1 ]
Rowan, Sean [1 ]
Lunt, Jonathan [1 ]
Wong, Theodore M. [2 ]
机构
[1] Intuit Inc, Mountain View, CA 94043 USA
[2] 23andMe Inc, Mountain View, CA USA
来源
PROCEEDINGS OF THE 2019 USENIX CONFERENCE ON OPERATIONAL MACHINE LEARNING | 2019年
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Modern machine learning pipelines can produce hundreds of data artifacts (such as features, models, and predictions) throughout their lifecycle. During that time, data scientists need to reproduce errors, update features, re-train on specific data, validate / inspect outputs, and share models and predictions. Doing so requires the ability to publish, discover, and version those artifacts. This work introduces Disdat, a system to simplify ML pipelines by addressing these data management challenges. Disdat is built on two core data abstractions: bundles and contexts. A bundle is a versioned, typed, immutable collection of data. A context is a sharable set of bundles that can exist on local and cloud storage environments. Disdat provides a bundle management API that we use to extend an existing workflow system to produce and consume bundles. This bundle-based approach to data management has simplified both authoring and deployment of our ML pipelines.
引用
收藏
页码:35 / 37
页数:3
相关论文
共 50 条
  • [1] Data pricing in machine learning pipelines
    Zicun Cong
    Xuan Luo
    Jian Pei
    Feida Zhu
    Yong Zhang
    Knowledge and Information Systems, 2022, 64 : 1417 - 1455
  • [2] Data pricing in machine learning pipelines
    Cong, Zicun
    Luo, Xuan
    Pei, Jian
    Zhu, Feida
    Zhang, Yong
    KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (06) : 1417 - 1455
  • [3] Data distribution debugging in machine learning pipelines
    Grafberger, Stefan
    Groth, Paul
    Stoyanovich, Julia
    Schelter, Sebastian
    VLDB JOURNAL, 2022, 31 (05): : 1103 - 1126
  • [4] Data distribution debugging in machine learning pipelines
    Stefan Grafberger
    Paul Groth
    Julia Stoyanovich
    Sebastian Schelter
    The VLDB Journal, 2022, 31 : 1103 - 1126
  • [5] Optimizing Data Pipelines for Machine Learning in Feature Stores
    Liu, Rui
    Park, Kwanghyun
    Psallidas, Fotis
    Zhu, Xiaoyong
    Mo, Jinghui
    Sen, Rathijit
    Interlandi, Matteo
    Karanasos, Konstantinos
    Tian, Yuanyuan
    Camacho-Rodriguez, Jesus
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4230 - 4239
  • [6] MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines
    Grafberger, Stefan
    Guha, Shubha
    Stoyanovich, Julia
    Schelter, Sebastian
    SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2736 - 2739
  • [7] A Machine Learning Approach for Big Data in Oil and Gas Pipelines
    Mohamed, Abduljalil
    Hamdi, Mohamed Salah
    Tahar, Sofiene
    2015 3RD INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD) AND INTERNATIONAL CONFERENCE ON OPEN AND BIG (OBD), 2015, : 585 - 590
  • [8] Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines
    Dong, Sijie
    Wang, Qitong
    Sahri, Soror
    Palpanas, Themis
    Srivastava, Divesh
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (11): : 3072 - 3081
  • [9] cedar: Optimized and Unified Machine Learning Input Data Pipelines
    Zhao, Mark
    Adamiak, Emanuel
    Kozyrakis, Christos
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 18 (02): : 488 - 502
  • [10] On the Democratization of Machine Learning Pipelines
    Carqueja, Alexandre
    Cabral, Bruno
    Fernandes, Joao Paulo
    Lourenco, Nuno
    2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 455 - 462