A Data-Driven Analysis of Behaviors in Data Curation Processes

被引:0
|
作者
Han, Lei [1 ]
Chen, Tianwa [1 ]
Demartini, Gianluca [1 ]
Indulska, Marta [1 ]
Sadiq, Shazia [1 ]
机构
[1] Univ Queensland, Brisbane, Qld, Australia
关键词
Interaction behavior; search pattern; data curation; SOFTWARE;
D O I
10.1145/3567419
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Understanding how data workers interact with data, and various pieces of information related to data preparation, is key to designing systems that can better support them in exploring datasets. To date, however, there is a paucity of research studying the strategies adopted by data workers as they carry out data preparation activities. In this work, we investigate a specific data preparation activity, namely data quality discovery, and aim to (i) understand the behaviors of data workers in discovering data quality issues, (ii) explore what factors (e.g., prior experience) can affect their behaviors, as well as (iii) understand how these behavioral observations relate to their performance. To this end, we collect a multi-modal dataset through a data-driven experiment that relies on the use of eye-tracking technology with a purpose-designed platform built on top of iPython Notebook. The experiment results reveal that: (i) 'copy-paste-modify' is a typical strategy for writing code to complete tasks; (ii) proficiency in writing code has a significant impact on the quality of task performance, while perceived difficulty and efficacy can influence task completion patterns; and (iii) searching in external resources is a prevalent action that can be leveraged to achieve better performance. Furthermore, our experiment indicates that providing sample code within the system can help data workers get started with their task, and surfacing underlying data is an effective way to support exploration. By investigating data worker behaviors prior to each search action, we also find that the most common reasons that trigger external search actions are the need to seek assistance in writing or debugging code and to search for relevant code to reuse. Based on our experiment results, we showcase a systematic approach to select from the top best code snippets created by data workers and assemble them to achieve better performance than the best individual performer in the dataset. By doing so, our findings not only provide insights into patterns of interactions with various system components and information resources when performing data curation tasks, but also build effective and efficient data curation processes through data workers' collective intelligence.
引用
下载
收藏
页数:35
相关论文
共 50 条
  • [41] Data-driven Online Motion Analysis
    Huang, Tianyu
    Yang, Jia
    Li, Lijie
    2009 IEEE 10TH INTERNATIONAL CONFERENCE ON COMPUTER-AIDED INDUSTRIAL DESIGN & CONCEPTUAL DESIGN, VOLS 1-3: E-BUSINESS, CREATIVE DESIGN, MANUFACTURING - CAID&CD'2009, 2009, : 1407 - 1411
  • [42] Data-driven Forest Fire analysis
    Gao, Jerry
    Shalini, Kshama
    Gaur, Navit
    Guan, Xuan
    Chen, Sean
    Hong, Jesse
    Mahmoud, Medhat
    2017 IEEE SMARTWORLD, UBIQUITOUS INTELLIGENCE & COMPUTING, ADVANCED & TRUSTED COMPUTED, SCALABLE COMPUTING & COMMUNICATIONS, CLOUD & BIG DATA COMPUTING, INTERNET OF PEOPLE AND SMART CITY INNOVATION (SMARTWORLD/SCALCOM/UIC/ATC/CBDCOM/IOP/SCI), 2017,
  • [43] Data-driven analysis in drug discovery
    Kenakin, Terry
    JOURNAL OF RECEPTORS AND SIGNAL TRANSDUCTION, 2006, 26 (04) : 299 - 327
  • [44] Data-driven Crowd Analysis in Videos
    Rodriguez, Mikel
    Sivic, Josef
    Laptev, Ivan
    Audibert, Jean-Yves
    2011 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2011, : 1235 - 1242
  • [45] Data-Driven Shape Analysis and Processing
    Xu, Kai
    Kim, Vladimir G.
    Huang, Qixing
    Kalogerakis, Evangelos
    COMPUTER GRAPHICS FORUM, 2017, 36 (01) : 101 - 132
  • [46] PERFORMANCE ANALYSIS OF DATA-DRIVEN NETWORKS
    OLSDER, GJ
    SYSTOLIC ARRAY PROCESSORS, 1989, : 33 - 41
  • [47] Data-driven stochastic processes in fully developed turbulence
    Greiner, M
    Cleve, J
    Schmiegel, J
    Sreenivasan, KR
    Probability and Partial Differential Equations in Modern Applied Mathematics, 2005, 140 : 137 - 150
  • [48] A Big Data-driven Model for the Optimization of Healthcare Processes
    Koufi, Vassiliki
    Malamateniou, Flora
    Vassilacopoulos, George
    DIGITAL HEALTHCARE EMPOWERING EUROPEANS, 2015, 210 : 697 - 701
  • [49] Observational data-driven modeling and optimization of manufacturing processes
    Sadati, Najibesadat
    Chinnam, Ratna Babu
    Nezhad, Milad Zafar
    EXPERT SYSTEMS WITH APPLICATIONS, 2018, 93 : 456 - 464
  • [50] Data-driven clinical decision processes: it’s time
    Enrico Capobianco
    Journal of Translational Medicine, 17