Improving Consensus Scoring of Crowdsourced Data Using the Rasch Model: Development and Refinement of a Diagnostic Instrument

被引:10
|
作者
Brady, Christopher John [1 ]
Mudie, Lucy Iluka [1 ]
Wang, Xueyang [1 ]
Guallar, Eliseo [2 ]
Friedman, David Steven [1 ,2 ]
机构
[1] Johns Hopkins Univ, Sch Med, Wilmer Eye Inst, Dana Ctr Prevent Ophthalmol, 600 N Wolfe St, Baltimore, MD 21205 USA
[2] Johns Hopkins Univ, Dept Epidemiol, Bloomberg Sch Publ Hlth, Baltimore, MD USA
基金
美国国家卫生研究院;
关键词
crowdsourcing; diabetic retinopathy; Rasch analysis; Amazon Mechanical Turk; DIABETIC-RETINOPATHY; TELEMEDICINE; RISK; MELLITUS; IMAGES;
D O I
10.2196/jmir.7984
中图分类号
R19 [保健组织与事业(卫生事业管理)];
学科分类号
摘要
Background: Diabetic retinopathy (DR) is a leading cause of vision loss in working age individuals worldwide. While screening is effective and cost effective, it remains underutilized, and novel methods are needed to increase detection of DR. This clinical validation study compared diagnostic gradings of retinal fundus photographs provided by volunteers on the Amazon Mechanical Turk (AMT) crowdsourcing marketplace with expert-provided gold-standard grading and explored whether determination of the consensus of crowdsourced classifications could be improved beyond a simple majority vote (MV) using regression methods. Objective: The aim of our study was to determine whether regression methods could be used to improve the consensus grading of data collected by crowdsourcing. Methods: A total of 1200 retinal images of individuals with diabetes mellitus from the Messidor public dataset were posted to AMT. Eligible crowdsourcing workers had at least 500 previously approved tasks with an approval rating of 99% across their prior submitted work. A total of 10 workers were recruited to classify each image as normal or abnormal. If half or more workers judged the image to be abnormal, the MV consensus grade was recorded as abnormal. Rasch analysis was then used to calculate worker ability scores in a random 50% training set, which were then used as weights in a regression model in the remaining 50% test set to determine if a more accurate consensus could be devised. Outcomes of interest were the percent correctly classified images, sensitivity, specificity, and area under the receiver operating characteristic (AUROC) for the consensus grade as compared with the expert grading provided with the dataset. Results: Using MV grading, the consensus was correct in 75.5% of images (906/1200), with 75.5% sensitivity, 75.5% specificity, and an AUROC of 0.75 (95% CI 0.73-0.78). A logistic regression model using Rasch-weighted individual scores generated an AUROC of 0.91 (95% CI 0.88-0.93) compared with 0.89 (95% CI 0.86-92) for a model using unweighted scores (chi-square P value<.001). Setting a diagnostic cut-point to optimize sensitivity at 90%, 77.5% (465/600) were graded correctly, with 90.3% sensitivity, 68.5% specificity, and an AUROC of 0.79 (95% CI 0.76-0.83). Conclusions: Crowdsourced interpretations of retinal images provide rapid and accurate results as compared with a gold-standard grading. Creating a logistic regression model using Rasch analysis to weight crowdsourced classifications by worker ability improves accuracy of aggregated grades as compared with simple majority vote.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] DEVELOPMENT AND VALIDATION OF A DIAGNOSTIC MODEL AND SCORING SYSTEM FOR TRANSTHYRETIN CARDIAC AMYLOIDOSIS
    Bukhari, Syed
    Malhotra, Saurabh
    Shpilsky, Dan
    Nieves, Ric
    Bashir, Zubair
    Soman, Prem
    JOURNAL OF INVESTIGATIVE MEDICINE, 2021, 69 (05) : 1071 - 1072
  • [42] ANTENNA MODEL REFINEMENT TECHNIQUE FROM SAR DATA: A STUDY ON THE ENVISAT ASAR INSTRUMENT
    Villa, Alberto
    Giudici, Davide
    D'Aria, Davide
    Recchia, Andrea
    Miranda, Nuno
    2012 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS), 2012, : 4517 - 4520
  • [43] Assessing the diagnostic utility of the Gaucher Earlier Diagnosis Consensus (GED-C) scoring system using real-world data
    Revel-Vilk, Shoshana
    Shalev, Varda
    Gill, Aidan
    Paltiel, Ora
    Manor, Orly
    Tenenbaum, Avraham
    Azani, Liat
    Chodick, Gabriel
    ORPHANET JOURNAL OF RARE DISEASES, 2024, 19 (01)
  • [44] Assessing the diagnostic utility of the Gaucher Earlier Diagnosis Consensus (GED-C) scoring system using real-world data
    Shoshana Revel-Vilk
    Varda Shalev
    Aidan Gill
    Ora Paltiel
    Orly Manor
    Avraham Tenenbaum
    Liat Azani
    Gabriel Chodick
    Orphanet Journal of Rare Diseases, 19
  • [45] Validity and Lecturer's Different Perceptions in Organizational Citizenship Behavior Instrument Using Rasch Model
    Patras, Yuyun Elizabeth
    Suhardi, Eka
    Hidayat, Rais
    Sarimanah, Eri
    PROCEEDINGS OF THE 3RD ASIAN EDUCATION SYMPOSIUM (AES 2018), 2018, 253 : 196 - 201
  • [46] Modification and calibration of an exercise physical activity barrier instrument using the Rasch rating scaling model
    Zhu, WM
    RESEARCH QUARTERLY FOR EXERCISE AND SPORT, 1999, 70 (01) : A22 - A22
  • [47] Machine scoring model using data mining techniques
    Laosiritaworn, Wimalin S.
    Holimchayachotikul, Pongsak
    World Academy of Science, Engineering and Technology, 2010, 40 : 571 - 575
  • [48] Improving Law Enforcement and Emergency Response to Disabled Vehicle Crashes Using Waze Crowdsourced Data
    Sandt, Adrian
    McCombs, John
    Al-Deek, Haitham
    Carrick, Grady
    TRANSPORTATION RESEARCH RECORD, 2024, 2678 (05) : 666 - 676
  • [49] Scoring Neuropsychological Tests Using the Rasch Model: An Illustrative Example With the Rey-Osterrieth Complex Figure
    Prieto, Gerardo
    Delgado, Ana R.
    Perea, Maria V.
    Ladera, Valentina
    CLINICAL NEUROPSYCHOLOGIST, 2010, 24 (01) : 45 - 56
  • [50] Using Refinement in Formal Development of OS Security Model
    Devyanin, Petr N.
    Khoroshilov, Alexey V.
    Kuliamin, Victor V.
    Petrenko, Alexander K.
    Shchepetkov, Ilya V.
    PERSPECTIVES OF SYSTEM INFORMATICS, PSI 2015, 2016, 9609 : 107 - 115