A Model-Driven Approach for Systematic Reproducibility and Replicability of Data Science Projects

被引:3
|
作者
Melchor, Fran [1 ]
Rodriguez-Echeverria, Roberto [1 ]
Conejero, Jose M. [1 ]
Prieto, Alvaro E. [1 ]
Gutierrez, Juan D. [1 ]
机构
[1] Univ Extremadura, INTIA, Caceres, Spain
关键词
Reproducibility; Replicability; Process; Data science; Model-driven engineering; PROVENANCE; PIPELINES;
D O I
10.1007/978-3-031-07472-1_9
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
In the last few years, there has been an important increase in the number of tools and approaches to define pipelines that allow the development of data science projects. They allow not only the pipeline definition but also the code generation needed to execute the project providing an easy way to carry out the projects even for non-expert users. However, there are still some challenges that these tools do not address yet, e.g. the possibility of executing pipelines defined by using different tools or execute them in different environments (reproducibility and replicability) or models validation and verification by identifying inconsistent operations (intentionality). In order to alleviate these problems, this paper presents a Model-Driven framework for the definition of data science pipelines independent of the particular execution platform and tools. The framework relies on the separation of the pipeline definition into two different modelling layers: conceptual, where the data scientist may specify all the data and models operations to be carried out by the pipeline; operational, where the data engineer may describe the execution environment details where the operations (defined in the conceptual part) will be implemented. Based on this abstract definition and layers separation, the approach allows: the usage of different tools improving, thus, process replicability; the automation of the process execution, enhancing process reproducibility; and the definition of model verification rules, providing intentionality restrictions.
引用
收藏
页码:147 / 163
页数:17
相关论文
共 50 条
  • [1] Model-Driven Approach for Making Citizen Science Data FAIR
    Luna, Reynaldo Alvarez
    Garrigos, Irene
    Zubcoff, Jose
    Gonzalez-Mora, Cesar
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2024, 34 (06) : 891 - 907
  • [2] A better approach for dealing with reproducibility and replicability in science
    Nichols, James D.
    Oli, Madan K.
    Kendall, William. L.
    Boomer, G. Scott
    PROCEEDINGS OF THE NATIONAL ACADEMY OF SCIENCES OF THE UNITED STATES OF AMERICA, 2021, 118 (07)
  • [3] A Model-Driven Approach for Biomedical Data Integration
    Carlson, David
    Farkash, Ariel
    Timm, John T. E.
    MEDINFO 2010, PTS I AND II, 2010, 160 : 1164 - 1168
  • [4] Model-driven Architecture Approach for Data Warehouse
    Fernandes, Lucia Abrunhosa
    Helena Neto, Beatriz
    Fagundes, Vladimir
    Zimbrao, Geraldo
    de Souza, Jano Moreira
    Salvador, Rodrigo
    SIXTH INTERNATIONAL CONFERENCE ON AUTONOMIC AND AUTONOMOUS SYSTEMS: ICAS 2010, PROCEEDINGS, 2010, : 156 - 161
  • [5] A Model-driven Approach to Data Structure Conceptualization
    Ristic, Sonja
    Kordic, Slavica
    Celikovic, Milan
    Dimitrieski, Vladimir
    Lukovic, Ivan
    PROCEEDINGS OF THE 2015 FEDERATED CONFERENCE ON COMPUTER SCIENCE AND INFORMATION SYSTEMS, 2015, 5 : 977 - 984
  • [6] A Model-Driven Approach to Enterprise Data Migration
    Yeddula, Raghavendra Reddy
    Das, Prasenjit
    Reddy, Sreedhar
    ADVANCED INFORMATION SYSTEMS ENGINEERING, CAISE 2015, 2015, 9097 : 230 - 243
  • [7] A data- and model-driven approach for cancer treatment
    Sophia Schade
    Lesley A. Ogilvie
    Thomas Kessler
    Moritz Schütte
    Christoph Wierling
    Bodo M. Lange
    Hans Lehrach
    Marie-Laure Yaspo
    Der Onkologe, 2019, 25 : 132 - 137
  • [8] A data- and model-driven approach for cancer treatment
    Schade, Sophia
    Ogilvie, Lesley A.
    Kessler, Thomas
    Schuette, Moritz
    Wierling, Christoph
    Lange, Bodo M.
    Lehrach, Hans
    Yaspo, Marie-Laure
    ONKOLOGE, 2019, 25 (Suppl 2): : 132 - 137
  • [9] DataSpecer: A Model-Driven Approach to Managing Data Specifications
    Stenchlak, Stepan
    Necasky, Martin
    Skoda, Petr
    Klimek, Jakub
    SEMANTIC WEB: ESWC 2022 SATELLITE EVENTS, 2022, 13384 : 52 - 56
  • [10] Explainability in Graph Data Science Interpretability, replicability, and reproducibility of community detection
    Aviyente, Selin
    Karaaslanli, Abdullah
    IEEE SIGNAL PROCESSING MAGAZINE, 2022, 39 (04) : 25 - 39