Disdat: Bundle Data Management for Machine Learning Pipelines

被引：0

作者：

Yocum, Ken ^{[1
]}

Rowan, Sean ^{[1
]}

Lunt, Jonathan ^{[1
]}

Wong, Theodore M. ^{[2
]}

机构：

[1] Intuit Inc, Mountain View, CA 94043 USA

[2] 23andMe Inc, Mountain View, CA USA

来源：

PROCEEDINGS OF THE 2019 USENIX CONFERENCE ON OPERATIONAL MACHINE LEARNING | 2019年

关键词：

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Modern machine learning pipelines can produce hundreds of data artifacts (such as features, models, and predictions) throughout their lifecycle. During that time, data scientists need to reproduce errors, update features, re-train on specific data, validate / inspect outputs, and share models and predictions. Doing so requires the ability to publish, discover, and version those artifacts. This work introduces Disdat, a system to simplify ML pipelines by addressing these data management challenges. Disdat is built on two core data abstractions: bundles and contexts. A bundle is a versioned, typed, immutable collection of data. A context is a sharable set of bundles that can exist on local and cloud storage environments. Disdat provides a bundle management API that we use to extend an existing workflow system to produce and consume bundles. This bundle-based approach to data management has simplified both authoring and deployment of our ML pipelines.

引用

页码：35 / 37

页数：3

共 50 条

[1] Data pricing in machine learning pipelines
Zicun Cong
Xuan Luo
Jian Pei
Feida Zhu
Yong Zhang
Knowledge and Information Systems, 2022, 64 : 1417 - 1455
[2] Data pricing in machine learning pipelines
Cong, Zicun
Luo, Xuan
Pei, Jian
Zhu, Feida
Zhang, Yong
KNOWLEDGE AND INFORMATION SYSTEMS, 2022, 64 (06) : 1417 - 1455
[3] Data distribution debugging in machine learning pipelines
Grafberger, Stefan
Groth, Paul
Stoyanovich, Julia
Schelter, Sebastian
VLDB JOURNAL, 2022, 31 (05): : 1103 - 1126
[4] Data distribution debugging in machine learning pipelines
Stefan Grafberger
Paul Groth
Julia Stoyanovich
Sebastian Schelter
The VLDB Journal, 2022, 31 : 1103 - 1126
[5] Optimizing Data Pipelines for Machine Learning in Feature Stores
Liu, Rui
Park, Kwanghyun
Psallidas, Fotis
Zhu, Xiaoyong
Mo, Jinghui
Sen, Rathijit
Interlandi, Matteo
Karanasos, Konstantinos
Tian, Yuanyuan
Camacho-Rodriguez, Jesus
PROCEEDINGS OF THE VLDB ENDOWMENT, 2023, 16 (13): : 4230 - 4239
[6] MLINSPECT: A Data Distribution Debugger for Machine Learning Pipelines
Grafberger, Stefan
Guha, Shubha
Stoyanovich, Julia
Schelter, Sebastian
SIGMOD '21: PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2021, : 2736 - 2739
[7] A Machine Learning Approach for Big Data in Oil and Gas Pipelines
Mohamed, Abduljalil
Hamdi, Mohamed Salah
Tahar, Sofiene
2015 3RD INTERNATIONAL CONFERENCE ON FUTURE INTERNET OF THINGS AND CLOUD (FICLOUD) AND INTERNATIONAL CONFERENCE ON OPEN AND BIG (OBD), 2015, : 585 - 590
[8] Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines
Dong, Sijie
Wang, Qitong
Sahri, Soror
Palpanas, Themis
Srivastava, Divesh
PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 17 (11): : 3072 - 3081
[9] cedar: Optimized and Unified Machine Learning Input Data Pipelines
Zhao, Mark
Adamiak, Emanuel
Kozyrakis, Christos
PROCEEDINGS OF THE VLDB ENDOWMENT, 2024, 18 (02): : 488 - 502
[10] On the Democratization of Machine Learning Pipelines
Carqueja, Alexandre
Cabral, Bruno
Fernandes, Joao Paulo
Lourenco, Nuno
2022 IEEE SYMPOSIUM SERIES ON COMPUTATIONAL INTELLIGENCE (SSCI), 2022, : 455 - 462

← 1 2 3 4 5 →