A Data-Driven Analysis of Behaviors in Data Curation Processes

被引:0
|
作者
Han, Lei [1 ]
Chen, Tianwa [1 ]
Demartini, Gianluca [1 ]
Indulska, Marta [1 ]
Sadiq, Shazia [1 ]
机构
[1] Univ Queensland, Brisbane, Qld, Australia
关键词
Interaction behavior; search pattern; data curation; SOFTWARE;
D O I
10.1145/3567419
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how data workers interact with data, and various pieces of information related to data preparation, is key to designing systems that can better support them in exploring datasets. To date, however, there is a paucity of research studying the strategies adopted by data workers as they carry out data preparation activities. In this work, we investigate a specific data preparation activity, namely data quality discovery, and aim to (i) understand the behaviors of data workers in discovering data quality issues, (ii) explore what factors (e.g., prior experience) can affect their behaviors, as well as (iii) understand how these behavioral observations relate to their performance. To this end, we collect a multi-modal dataset through a data-driven experiment that relies on the use of eye-tracking technology with a purpose-designed platform built on top of iPython Notebook. The experiment results reveal that: (i) 'copy-paste-modify' is a typical strategy for writing code to complete tasks; (ii) proficiency in writing code has a significant impact on the quality of task performance, while perceived difficulty and efficacy can influence task completion patterns; and (iii) searching in external resources is a prevalent action that can be leveraged to achieve better performance. Furthermore, our experiment indicates that providing sample code within the system can help data workers get started with their task, and surfacing underlying data is an effective way to support exploration. By investigating data worker behaviors prior to each search action, we also find that the most common reasons that trigger external search actions are the need to seek assistance in writing or debugging code and to search for relevant code to reuse. Based on our experiment results, we showcase a systematic approach to select from the top best code snippets created by data workers and assemble them to achieve better performance than the best individual performer in the dataset. By doing so, our findings not only provide insights into patterns of interactions with various system components and information resources when performing data curation tasks, but also build effective and efficient data curation processes through data workers' collective intelligence.
引用
下载
收藏
页数:35
相关论文
共 50 条
  • [21] Data-Driven Anomaly Diagnosis for Machining Processes
    Liang, Y. C.
    Wang, S.
    Li, W. D.
    Lu, X.
    ENGINEERING, 2019, 5 (04) : 646 - 652
  • [22] Autoregressive processes with data-driven regime switching
    Kamgaing, Joseph Tadjuidje
    Ombao, Hernando
    Davis, Richard A.
    JOURNAL OF TIME SERIES ANALYSIS, 2009, 30 (05) : 505 - 533
  • [23] AN APPROACH TO DATA-DRIVEN ADAPTABLE SERVICE PROCESSES
    Athanasopoulos, George
    Tsalgatidou, Aphrodite
    ICSOFT 2010: PROCEEDINGS OF THE FIFTH INTERNATIONAL CONFERENCE ON SOFTWARE AND DATA TECHNOLOGIES, VOL 1, 2010, : 139 - 145
  • [24] Data-Driven Customization of Object Lifecycle Processes
    Breitmayer, Marius
    Arnold, Lisa
    Reichert, Manfred
    2023 IEEE 25TH CONFERENCE ON BUSINESS INFORMATICS, CBI, 2023, : 77 - 86
  • [25] Analysis and data-driven reconstruction of bivariate jump-diffusion processes
    Gorjao, Leonardo Rydin
    Heysel, Jan
    Lehnertz, Klaus
    Tabar, M. Reza Rahimi
    PHYSICAL REVIEW E, 2019, 100 (06)
  • [26] Data-Driven Distributed Mitigation Strategies and Analysis of Mutating Epidemic Processes
    Pare, Philip E.
    Gracy, Sebin
    Sandberg, Henrik
    Johansson, Karl Henrik
    2020 59TH IEEE CONFERENCE ON DECISION AND CONTROL (CDC), 2020, : 6138 - 6143
  • [27] A Data-Driven Statistical Approach for Monitoring and Analysis of Large Industrial Processes
    Montazeri, A.
    Ansarizadeh, M. H.
    Arefi, M. M.
    IFAC PAPERSONLINE, 2019, 52 (13): : 2354 - 2359
  • [28] A Data-Driven Causality Analysis Tool for Fault Diagnosis in Industrial Processes
    Alizadeh, Esmaeil
    El Koujok, Mohamed
    Ragab, Ahmed
    Amazouz, Mouloud
    IFAC PAPERSONLINE, 2018, 51 (24): : 147 - 152
  • [29] Modeling and analysis of ExtendSim model and data-driven command and control processes
    Ge B.
    Xia B.
    Yang Z.
    Zhao Q.
    Wei H.
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2020, 42 (05): : 1063 - 1072
  • [30] Data-driven discovery of emergent behaviors in collective dynamics
    Zhong, Ming
    Miller, Jason
    Maggioni, Mauro
    PHYSICA D-NONLINEAR PHENOMENA, 2020, 411