Making inference with messy (citizen science) data: when are data accurate enough and how can they be improved?

被引:38
|
作者
Clare, John D. J. [1 ]
Townsend, Philip A. [1 ]
Anhalt-Depies, Christine [1 ]
Locke, Christina [2 ]
Stenglein, Jennifer L. [2 ]
Frett, Susan [2 ]
Martin, Karl J. [3 ]
Singh, Aditya [1 ,4 ]
Van Deelen, Timothy R. [1 ]
Zuckerberg, Benjamin [1 ]
机构
[1] Univ Wisconsin, Dept Forest & Wildlife Ecol, 1630 Linden Dr, Madison, WI 53706 USA
[2] Wisconsin Dept Nat Resources, Off Appl Sci, Madison, WI 53716 USA
[3] Univ Wisconsin, Div Cooperat Extens, Madison, WI 53706 USA
[4] Univ Florida, Dept Agr & Biol Engn, Gainesville, FL 32611 USA
关键词
automated classification; citizen science; crowdsourcing; false-positive error; misclassification; remote camera; species distribution model; DATA QUALITY; CAMERA TRAPS; ERROR; BIODIVERSITY; CHALLENGES; VOLUNTEER; MODELS; STATE; TOOL;
D O I
10.1002/eap.1849
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Measurement or observation error is common in ecological data: as citizen scientists and automated algorithms play larger roles processing growing volumes of data to address problems at large scales, concerns about data quality and strategies for improving it have received greater focus. However, practical guidance pertaining to fundamental data quality questions for data users or managers-how accurate do data need to be and what is the best or most efficient way to improve it?-remains limited. We present a generalizable framework for evaluating data quality and identifying remediation practices, and demonstrate the framework using trail camera images classified using crowdsourcing to determine acceptable rates of misclassification and identify optimal remediation strategies for analysis using occupancy models. We used expert validation to estimate baseline classification accuracy and simulation to determine the sensitivity of two occupancy estimators (standard and false-positive extensions) to different empirical misclassification rates. We used regression techniques to identify important predictors of misclassification and prioritize remediation strategies. More than 93% of images were accurately classified, but simulation results suggested that most species were not identified accurately enough to permit distribution estimation at our predefined threshold for accuracy (<5% absolute bias). A model developed to screen incorrect classifications predicted misclassified images with >97% accuracy: enough to meet our accuracy threshold. Occupancy models that accounted for false-positive error provided even more accurate inference even at high rates of misclassification (30%). As simulation suggested occupancy models were less sensitive to additional false-negative error, screening models or fitting occupancy models accounting for false-positive error emerged as efficient data remediation solutions. Combining simulation-based sensitivity analysis with empirical estimation of baseline error and its variability allows users and managers of potentially error-prone data to identify and fix problematic data more efficiently. It may be particularly helpful for "big data" efforts dependent upon citizen scientists or automated classification algorithms with many downstream users, but given the ubiquity of observation or measurement error, even conventional studies may benefit from focusing more attention upon data quality.
引用
下载
收藏
页数:15
相关论文
共 50 条
  • [1] How IoT-Driven Citizen Science Coupled with Data Satisficing Can Promote Deep Citizen Science
    Poslad, Stefan
    Irum, Tayyaba
    Charlton, Patricia
    Mumtaz, Rafia
    Azam, Muhammad
    Zaidi, Hassan
    Herodotou, Christothea
    Yu, Guangxia
    Toosy, Fesal
    SENSORS, 2022, 22 (09)
  • [2] When are groundwater data enough for decision-making?
    Vivier, J. J. P.
    Van der Walt, I. J.
    ASSESSING AND MANAGING GROUNDWATER IN DIFFERENT ENVIRONMENTS, 2014, 19 : 25 - 42
  • [3] Is fighting with data enough? Prospects for transformative citizen science in the Chinese Anthropocene
    Brombal, Daniele
    JOURNAL OF ENVIRONMENTAL PLANNING AND MANAGEMENT, 2020, 63 (01) : 32 - 48
  • [5] How to test overfitting when there is not enough data for a testing set?
    Rejer, I
    Piegat, A
    NEURAL NETWORKS AND SOFT COMPUTING, 2003, : 262 - 267
  • [6] Citizen science: how high school students can provide scientifically sound data in biogeochemical experiments
    Weigelhofer, Gabriele
    Poelz, Eva-Maria
    Hein, Thomas
    FRESHWATER SCIENCE, 2019, 38 (02) : 236 - 243
  • [7] Genetic markers: How accurate can genetic data be?
    Chikhi, L.
    HEREDITY, 2008, 101 (06) : 471 - 472
  • [8] Genetic markers: How accurate can genetic data be?
    L Chikhi
    Heredity, 2008, 101 : 471 - 472
  • [9] Model-Driven Approach for Making Citizen Science Data FAIR
    Luna, Reynaldo Alvarez
    Garrigos, Irene
    Zubcoff, Jose
    Gonzalez-Mora, Cesar
    INTERNATIONAL JOURNAL OF SOFTWARE ENGINEERING AND KNOWLEDGE ENGINEERING, 2024, 34 (06) : 891 - 907
  • [10] How Networks of Citizen Observatories Can Increase the Quality and Quantity of Citizen-Science-Generated Data Used to Monitor SDG Indicators
    Woods, Sasha Marie
    Daskolia, Maria
    Joly, Alexis
    Bonnet, Pierre
    Soacha, Karen
    Linan, Sonia
    Woods, Tim
    Piera, Jaume
    Ceccaroni, Luigi
    SUSTAINABILITY, 2022, 14 (07)