A Data-Driven Analysis of Behaviors in Data Curation Processes

被引:0
|
作者
Han, Lei [1 ]
Chen, Tianwa [1 ]
Demartini, Gianluca [1 ]
Indulska, Marta [1 ]
Sadiq, Shazia [1 ]
机构
[1] Univ Queensland, Brisbane, Qld, Australia
关键词
Interaction behavior; search pattern; data curation; SOFTWARE;
D O I
10.1145/3567419
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how data workers interact with data, and various pieces of information related to data preparation, is key to designing systems that can better support them in exploring datasets. To date, however, there is a paucity of research studying the strategies adopted by data workers as they carry out data preparation activities. In this work, we investigate a specific data preparation activity, namely data quality discovery, and aim to (i) understand the behaviors of data workers in discovering data quality issues, (ii) explore what factors (e.g., prior experience) can affect their behaviors, as well as (iii) understand how these behavioral observations relate to their performance. To this end, we collect a multi-modal dataset through a data-driven experiment that relies on the use of eye-tracking technology with a purpose-designed platform built on top of iPython Notebook. The experiment results reveal that: (i) 'copy-paste-modify' is a typical strategy for writing code to complete tasks; (ii) proficiency in writing code has a significant impact on the quality of task performance, while perceived difficulty and efficacy can influence task completion patterns; and (iii) searching in external resources is a prevalent action that can be leveraged to achieve better performance. Furthermore, our experiment indicates that providing sample code within the system can help data workers get started with their task, and surfacing underlying data is an effective way to support exploration. By investigating data worker behaviors prior to each search action, we also find that the most common reasons that trigger external search actions are the need to seek assistance in writing or debugging code and to search for relevant code to reuse. Based on our experiment results, we showcase a systematic approach to select from the top best code snippets created by data workers and assemble them to achieve better performance than the best individual performer in the dataset. By doing so, our findings not only provide insights into patterns of interactions with various system components and information resources when performing data curation tasks, but also build effective and efficient data curation processes through data workers' collective intelligence.
引用
下载
收藏
页数:35
相关论文
共 50 条
  • [1] Data-Driven Performance Analysis of Scheduled Processes
    Senderovich, Arik
    Rogge-Solti, Andreas
    Gal, Avigdor
    Mendling, Jan
    Mandelbaum, Avishai
    Kadish, Sarah
    Bunnell, Craig A.
    BUSINESS PROCESS MANAGEMENT, BPM 2015, 2015, 9253 : 35 - 52
  • [3] Data-driven Curation, Learning and Analysis for Inferring Evolving loT Botnets in the Wild
    Pour, Morteza Safaei
    Mangino, Antonio
    Friday, Kurt
    Rathbun, Matthias
    Bou-Harb, Elias
    Iqbal, Farkhund
    Shaban, Khaled
    Erradi, Abdelkarim
    14TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY (ARES 2019), 2019,
  • [4] Avant-garde: an automated data-driven DIA data curation tool
    Vaca Jacome, Alvaro Sebastian
    Peckner, Ryan
    Shulman, Nicholas
    Krug, Karsten
    DeRuff, Katherine C.
    Officer, Adam
    Christianson, Karen E.
    MacLean, Brendan
    MacCoss, Michael J.
    Carr, Steven A.
    Jaffe, Jacob D.
    NATURE METHODS, 2020, 17 (12) : 1237 - +
  • [5] Avant-garde: an automated data-driven DIA data curation tool
    Alvaro Sebastian Vaca Jacome
    Ryan Peckner
    Nicholas Shulman
    Karsten Krug
    Katherine C. DeRuff
    Adam Officer
    Karen E. Christianson
    Brendan MacLean
    Michael J. MacCoss
    Steven A. Carr
    Jacob D. Jaffe
    Nature Methods, 2020, 17 : 1237 - 1244
  • [6] EnzymeMap: curation, validation and data-driven prediction of enzymatic reactions
    Heid, Esther
    Probst, Daniel
    Green, William H.
    Madsen, Georg K. H.
    CHEMICAL SCIENCE, 2023, 14 (48) : 14229 - 14242
  • [7] Data-Driven Identification and Analysis of Waiting Times in Business Processes
    Ali, Muhammad Awais
    Milani, Fredrik
    Dumas, Marlon
    BUSINESS & INFORMATION SYSTEMS ENGINEERING, 2024,
  • [8] Data-Driven Analysis of Batch Processing Inefficiencies in Business Processes
    Lashkevich, Katsiaryna
    Milani, Fredrik
    Chapela-Campa, David
    Dumas, Marlon
    RESEARCH CHALLENGES IN INFORMATION SCIENCE, 2022, 446 : 231 - 247
  • [9] A data-driven approach to simulate collective behaviors
    de Andrade, Emerson Martins
    Sales Junior, Joel Sena
    Fernandes, Antonio Carlos
    2023 LATIN AMERICAN ROBOTICS SYMPOSIUM, LARS, 2023 BRAZILIAN SYMPOSIUM ON ROBOTICS, SBR, AND 2023 WORKSHOP ON ROBOTICS IN EDUCATION, WRE, 2023, : 125 - 128
  • [10] Data-driven aerodynamic analysis of structures using Gaussian Processes
    Kavrakov, Igor
    McRobie, Allan
    Morgenthal, Guido
    JOURNAL OF WIND ENGINEERING AND INDUSTRIAL AERODYNAMICS, 2022, 222