Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

被引:1
|
作者
Chapman, Adriane [1 ]
Lauro, Luca [2 ]
Missier, Paolo [3 ]
Torlone, Riccardo [2 ]
机构
[1] Univ Southampton, Sch Elect & Comp Sci, Highfield Campus, Southampton SO17 1SX, Hants, England
[2] Univ Roma Tre, Dipartimento Ingn, Via Vasca Navale 79, I-00146 Rome, Italy
[3] Newcastle Univ, Sch Comp, Urban Sci Bldg,1 Sci Sq, Newcastle Upon Tyne NE4 5TG, Tyne & Wear, England
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2024年 / 49卷 / 02期
关键词
Provenance; data science; data preparation; preprocessing; MODEL;
D O I
10.1145/3644385
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim at providing data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semiautomatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.
引用
收藏
页数:42
相关论文
共 50 条
  • [1] Capturing and Querying Fine-grained Provenance of Preprocessing Pipelines in Data Science
    Chapman, Adriane
    Missier, Paolo
    Simonelli, Giulia
    Torlone, Riccardo
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 14 (04): : 507 - 520
  • [2] GeneaLog: Fine-Grained Data Streaming Provenance at the Edge
    Palyvos-Giannas, Dimitris
    Gulisano, Vincenzo
    Papatriantafilou, Marina
    [J]. MIDDLEWARE'18: PROCEEDINGS OF THE 2018 ACM/IFIP/USENIX MIDDLEWARE CONFERENCE, 2018, : 227 - 238
  • [3] Fine-Grained, Secure and Efficient Data Provenance on Blockchain Systems
    Ruan, Pingcheng
    Chen, Gang
    Tien Tuan Anh Dinh
    Lin, Qian
    Ooi, Beng Chin
    Zhang, Meihui
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2019, 12 (09): : 975 - 988
  • [4] Fine-Grained Provenance for Matching & ETL
    Zheng, Nan
    Alawini, Abdussalam
    Ives, Zachary G.
    [J]. 2019 IEEE 35TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING (ICDE 2019), 2019, : 184 - 195
  • [5] LineageChain: a fine-grained, secure and efficient data provenance system for blockchains
    Pingcheng Ruan
    Tien Tuan Anh Dinh
    Qian Lin
    Meihui Zhang
    Gang Chen
    Beng Chin Ooi
    [J]. The VLDB Journal, 2021, 30 : 3 - 24
  • [6] LineageChain: a fine-grained, secure and efficient data provenance system for blockchains
    Ruan, Pingcheng
    Tien Tuan Anh Dinh
    Lin, Qian
    Zhang, Meihui
    Chen, Gang
    Ooi, Beng Chin
    [J]. VLDB JOURNAL, 2021, 30 (01): : 3 - 24
  • [7] Supporting fine-grained data lineage in a database visualization environment
    Woodruff, A
    Stonebraker, M
    [J]. 13TH INTERNATIONAL CONFERENCE ON DATA ENGINEERING - PROCEEDINGS, 1997, : 91 - 102
  • [8] A hybrid memory architecture supporting fine-grained data migration
    Ye Chi
    Jianhui Yue
    Xiaofei Liao
    Haikun Liu
    Hai Jin
    [J]. Frontiers of Computer Science, 2024, 18
  • [9] A hybrid memory architecture supporting fine-grained data migration
    Chi, Ye
    Yue, Jianhui
    Liao, Xiaofei
    Liu, Haikun
    Jin, Hai
    [J]. FRONTIERS OF COMPUTER SCIENCE, 2024, 18 (02)
  • [10] A Distributed System for The Management of Fine-grained Provenance
    Sultana, Salmin
    Bertino, Elisa
    [J]. JOURNAL OF DATABASE MANAGEMENT, 2015, 26 (02) : 32 - 47