Machine learning for automated content analysis: characteristics of training data impact reliability

被引:5
|
作者
Fussell, Rebeckah [1 ]
Mazrui, Ali [1 ]
Holmes, N. G. [1 ]
机构
[1] Cornell Univ, Lab Atom & Solid State Phys, Ithaca, NY 14853 USA
关键词
D O I
10.1119/perc.2022.pr.Fussell
中图分类号
G40 [教育学];
学科分类号
040101 ; 120403 ;
摘要
Natural language processing (NLP) has the capacity to increase the scale and efficiency of content analysis in Physics Education Research. One promise of this approach is the possibility of implementing coding schemes on large data sets taken from diverse contexts. Applying NLP has two main challenges, however. First, a large initial human-coded data set is needed for training, though it is not immediately clear how much training data are needed. Second, if new data are taken from a different context from the training data, automated coding may be impacted in unpredictable ways. In this study, we investigate the conditions necessary to address these two challenges for a survey question that probes students' perspectives on the reliability of physics experimental results. We use neural networks in conjunction with Bag of Words embedding to perform automated coding of student responses for two binary codes, meaning each code is either present or absent in a response. We find that i) substantial agreement is consistently achieved for our data when the training set exceeds 600 responses, with 80-100 responses containing each code and ii) it is possible to perform automated coding using training data from a disparate context, but variation in code frequencies (outcome balances) across specific contexts can affect the reliability of coding. We offer suggestions for best practices in automated coding. Other smaller-scale investigations across a diverse range of coding scheme types and data contexts are needed to develop generalized principles.
引用
收藏
页码:194 / 199
页数:6
相关论文
共 50 条
  • [1] Analysis of Image Thresholding Algorithms for Automated Machine Learning Training Data Generation
    Creek, Tristan
    Mullins, Barry E.
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON CYBER WARFARE AND SECURITY (ICCWS 2022), 2022, : 449 - 458
  • [2] Automated Shmoo Data Analysis: A Machine Learning Approach
    Wang, Wei
    PROCEEDINGS OF THE FIFTEENTH INTERNATIONAL SYMPOSIUM ON QUALITY ELECTRONIC DESIGN (ISQED 2014), 2015, : 212 - 218
  • [3] Exploring the Impact of Data Poisoning Attacks on Machine Learning Model Reliability
    Verde, Laura
    Marulli, Fiammetta
    Marrone, Stefano
    KNOWLEDGE-BASED AND INTELLIGENT INFORMATION & ENGINEERING SYSTEMS (KSE 2021), 2021, 192 : 2624 - 2632
  • [4] Machine learning approach for automated data analysis in tilted FBGs
    Leal-Junior, Arnaldo
    Avellar, Leandro
    Frizera, Anselmo
    Caucheteur, Christophe
    Marques, Carlos
    OPTICAL FIBER TECHNOLOGY, 2024, 84
  • [5] The impact of imbalanced training data on machine learning for author name disambiguation
    Jinseok Kim
    Jenna Kim
    Scientometrics, 2018, 117 : 511 - 526
  • [6] The impact of imbalanced training data on machine learning for author name disambiguation
    Kim, Jinseok
    Kim, Jenna
    SCIENTOMETRICS, 2018, 117 (01) : 511 - 526
  • [7] Automated CFRP impact damage detection with statistical thermographic data and machine learning
    Moskovchenko, Alexey
    Svantner, Michal
    INTERNATIONAL JOURNAL OF THERMAL SCIENCES, 2025, 208
  • [8] Assessing the Impact of Temporal Data Aggregation on the Reliability of Predictive Machine Learning Models
    Barhrhouj, Ayah
    Ananou, Bouchra
    Ouladsine, Mustapha
    INTELLIGENT DATA ENGINEERING AND AUTOMATED LEARNING - IDEAL 2024, PT I, 2025, 15346 : 481 - 492
  • [9] Reliability of probabilistic numerical data for training machine learning algorithms to detect damage in bridges
    Bud, Mihai Adrian
    Moldovan, Ionut
    Radu, Lucian
    Nedelcu, Mihai
    Figueiredo, Eloi
    STRUCTURAL CONTROL & HEALTH MONITORING, 2022, 29 (07):
  • [10] Automated analysis of high-content microscopy data with deep learning
    Kraus, Oren Z.
    Grys, Ben T.
    Ba, Jimmy
    Chong, Yolanda
    Frey, Brendan J.
    Boone, Charles
    Andrews, Brenda J.
    MOLECULAR SYSTEMS BIOLOGY, 2017, 13 (04)