Machine learning for automated content analysis: characteristics of training data impact reliability

被引:5
|
作者
Fussell, Rebeckah [1 ]
Mazrui, Ali [1 ]
Holmes, N. G. [1 ]
机构
[1] Cornell Univ, Lab Atom & Solid State Phys, Ithaca, NY 14853 USA
关键词
D O I
10.1119/perc.2022.pr.Fussell
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Natural language processing (NLP) has the capacity to increase the scale and efficiency of content analysis in Physics Education Research. One promise of this approach is the possibility of implementing coding schemes on large data sets taken from diverse contexts. Applying NLP has two main challenges, however. First, a large initial human-coded data set is needed for training, though it is not immediately clear how much training data are needed. Second, if new data are taken from a different context from the training data, automated coding may be impacted in unpredictable ways. In this study, we investigate the conditions necessary to address these two challenges for a survey question that probes students' perspectives on the reliability of physics experimental results. We use neural networks in conjunction with Bag of Words embedding to perform automated coding of student responses for two binary codes, meaning each code is either present or absent in a response. We find that i) substantial agreement is consistently achieved for our data when the training set exceeds 600 responses, with 80-100 responses containing each code and ii) it is possible to perform automated coding using training data from a disparate context, but variation in code frequencies (outcome balances) across specific contexts can affect the reliability of coding. We offer suggestions for best practices in automated coding. Other smaller-scale investigations across a diverse range of coding scheme types and data contexts are needed to develop generalized principles.
引用
收藏
页码:194 / 199
页数:6
相关论文
共 50 条
  • [21] Sensitivity Analysis of the Composite Data-Driven Pipelines in the Automated Machine Learning
    Barabanova, Irina, V
    Vychuzhanin, Pavel
    Nikitin, Nikolay O.
    10TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE IN COMPUTATIONAL SCIENCE (YSC2021), 2021, 193 : 484 - 493
  • [22] Application of Machine Learning Techniques for Stastical Analysis of Software Reliability Data Sets
    Shanthi, D.
    Mohanty, R. K.
    Narsimha, G.
    PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 1472 - 1474
  • [23] Automated Machine Learning for Enhanced Software Reliability Growth Modeling : A Comparative Analysis with Traditional SRGMs
    Kim, Taehyoun
    Ryu, Duksan
    Baik, Jongmoon
    2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 483 - 493
  • [24] Machine Learning Training on Encrypted Data with TFHE
    Montero, Luis
    Frery, Jordan
    Kherfallah, Celia
    Bredehoft, Roman
    Stoian, Andrei
    PROCEEDINGS OF THE 10TH ACM INTERNATIONAL WORKSHOP ON SECURITY AND PRIVACY ANALYTICS, IWSPA 2024, 2024, : 71 - 76
  • [25] Effect of Training Data Order for Machine Learning
    Mange, Jeremy
    2019 6TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI 2019), 2019, : 406 - 407
  • [26] Impact of Training and Testing Data Splits on Accuracy of Time Series Forecasting in Machine Learning
    Medar, Ramesh
    Rajpurohit, Vijay S.
    Rashmi, B.
    2017 INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2017,
  • [27] Automated Retrieval of Heterogeneous Proteomic Data for Machine Learning
    Rafay, Abdul
    Aziz, Muzzamil
    Zia, Amjad
    Asif, Abdul R. R.
    JOURNAL OF PERSONALIZED MEDICINE, 2023, 13 (05):
  • [28] AUTOMATED MACHINE LEARNING & SYNTHETIC DATA APPLICATIONS IN MEDICINE
    Rashidi, Hooman
    INTERNATIONAL JOURNAL OF LABORATORY HEMATOLOGY, 2023, 45 : 93 - 93
  • [29] Adaptation Strategies for Automated Machine Learning on Evolving Data
    Celik, Bilge
    Vanschoren, Joaquin
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (09) : 3067 - 3078
  • [30] Fast Training Data Generation for Machine Learning Analysis of Cosmic Ray Showers
    Hachaj, Tomasz
    Bibrzycki, Lukasz
    Piekarczyk, Marcin
    IEEE ACCESS, 2023, 11 : 7410 - 7419