Supporting Better Insights of Data Science Pipelines with Fine-grained Provenance

被引:1
|
作者
Chapman, Adriane [1 ]
Lauro, Luca [2 ]
Missier, Paolo [3 ]
Torlone, Riccardo [2 ]
机构
[1] Univ Southampton, Sch Elect & Comp Sci, Highfield Campus, Southampton SO17 1SX, Hants, England
[2] Univ Roma Tre, Dipartimento Ingn, Via Vasca Navale 79, I-00146 Rome, Italy
[3] Newcastle Univ, Sch Comp, Urban Sci Bldg,1 Sci Sq, Newcastle Upon Tyne NE4 5TG, Tyne & Wear, England
来源
ACM TRANSACTIONS ON DATABASE SYSTEMS | 2024年 / 49卷 / 02期
关键词
Provenance; data science; data preparation; preprocessing; MODEL;
D O I
10.1145/3644385
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Successful data-driven science requires complex data engineering pipelines to clean, transform, and alter data in preparation for machine learning, and robust results can only be achieved when each step in the pipeline can be justified, and its effect on the data explained. In this framework, we aim at providing data scientists with facilities to gain an in-depth understanding of how each step in the pipeline affects the data, from the raw input to training sets ready to be used for learning. Starting from an extensible set of data preparation operators commonly used within a data science setting, in this work we present a provenance management infrastructure for generating, storing, and querying very granular accounts of data transformations, at the level of individual elements within datasets whenever possible. Then, from the formal definition of a core set of data science preprocessing operators, we derive a provenance semantics embodied by a collection of templates expressed in PROV, a standard model for data provenance. Using those templates as a reference, our provenance generation algorithm generalises to any operator with observable input/output pairs. We provide a prototype implementation of an application-level provenance capture library to produce, in a semiautomatic way, complete provenance documents that account for the entire pipeline. We report on the ability of that reference implementation to capture provenance in real ML benchmark pipelines and over TCP-DI synthetic data. We finally show how the collected provenance can be used to answer a suite of provenance benchmark queries that underpin some common pipeline inspection questions, as expressed on the Data Science Stack Exchange.
引用
收藏
页数:42
相关论文
共 50 条
  • [31] Supporting project management with fine-grained artefact management in adams
    STAT Dept., University of Molise, C.da Fonte Lappone, 86090 Pesche , Italy
    不详
    [J]. Int J Comput Appl, 2009, 3 (145-152):
  • [32] Supporting distributed software development with fine-grained artefact management
    Bruegge, Bernd
    De Lucia, Andrea
    Fasano, Fausto
    Tortora, Genoveffa
    [J]. 2006 IEEE INTERNATIONAL CONFERENCE ON GLOBAL SOFTWARE ENGINEERING, PROCEEDINGS, 2006, : 213 - +
  • [33] Integrity check method for fine-grained data
    School of Information Science and Technology, Southwest Jiaotong University, Chengdu 610031, China
    不详
    [J]. Ruan Jian Xue Bao, 2009, 4 (902-909):
  • [34] Fine-Grained Queue Measurement in the Data Plane
    Chen, Xiaoqi
    Feibish, Shir Landau
    Koral, Yaron
    Rexford, Jennifer
    Rottenstreich, Ori
    Monetti, Steven A.
    Wang, Tzuu-Yi
    [J]. PROCEEDINGS OF THE 15TH INTERNATIONAL CONFERENCE ON EMERGING NETWORKING EXPERIMENTS AND TECHNOLOGIES (CONEXT '19), 2019, : 15 - 29
  • [35] Taming the IDE with Fine-grained Interaction Data
    Minelli, Roberto
    Mocci, Andrea
    Robbes, Romain
    Lanza, Michele
    [J]. 2016 IEEE 24TH INTERNATIONAL CONFERENCE ON PROGRAM COMPREHENSION (ICPC), 2016,
  • [36] Commonsense Oriented Fine-Grained Data Augmentation
    Li, Huachao
    Kang, Bin
    Wang, Lei
    [J]. Computer Engineering and Applications, 2024, 60 (06) : 214 - 221
  • [37] Authenticated Data Redaction with Fine-Grained Control
    Ma, Jinhua
    Liu, Jianghua
    Huang, Xinyi
    Xiang, Yang
    Wu, Wei
    [J]. IEEE TRANSACTIONS ON EMERGING TOPICS IN COMPUTING, 2020, 8 (02) : 291 - 302
  • [38] A data augment method for fine-grained recognition
    Zhang Y.
    Hu Z.
    Tian S.
    [J]. Zhang, Yin (yinzh@zju.edu.cn), 2018, Computer Society of the Republic of China (29) : 12 - 18
  • [39] Fine-grained Partitioning for Aggressive Data Skipping
    Sun, Liwen
    Franklin, Michael J.
    Krishnan, Sanjay
    Xin, Reynold S.
    [J]. SIGMOD'14: PROCEEDINGS OF THE 2014 ACM SIGMOD INTERNATIONAL CONFERENCE ON MANAGEMENT OF DATA, 2014, : 1115 - 1126
  • [40] Fine-Grained Data Committing for Persistent Memory
    Lu, Tianyue
    Liu, Yuhang
    Chen, Mingyu
    [J]. 2017 15TH IEEE INTERNATIONAL SYMPOSIUM ON PARALLEL AND DISTRIBUTED PROCESSING WITH APPLICATIONS AND 2017 16TH IEEE INTERNATIONAL CONFERENCE ON UBIQUITOUS COMPUTING AND COMMUNICATIONS (ISPA/IUCC 2017), 2017, : 438 - 443