Machine learning for automated content analysis: characteristics of training data impact reliability

被引：5

作者：

Fussell, Rebeckah ^{[1
]}

Mazrui, Ali ^{[1
]}

Holmes, N. G. ^{[1
]}

机构：

[1] Cornell Univ, Lab Atom & Solid State Phys, Ithaca, NY 14853 USA

来源：

2022 PHYSICS EDUCATION RESEARCH CONFERENCE (PERC) | 2022年

关键词：

D O I：

10.1119/perc.2022.pr.Fussell

中图分类号：

G40 [教育学];

学科分类号：

040101 ; 120403 ;

摘要：

Natural language processing (NLP) has the capacity to increase the scale and efficiency of content analysis in Physics Education Research. One promise of this approach is the possibility of implementing coding schemes on large data sets taken from diverse contexts. Applying NLP has two main challenges, however. First, a large initial human-coded data set is needed for training, though it is not immediately clear how much training data are needed. Second, if new data are taken from a different context from the training data, automated coding may be impacted in unpredictable ways. In this study, we investigate the conditions necessary to address these two challenges for a survey question that probes students' perspectives on the reliability of physics experimental results. We use neural networks in conjunction with Bag of Words embedding to perform automated coding of student responses for two binary codes, meaning each code is either present or absent in a response. We find that i) substantial agreement is consistently achieved for our data when the training set exceeds 600 responses, with 80-100 responses containing each code and ii) it is possible to perform automated coding using training data from a disparate context, but variation in code frequencies (outcome balances) across specific contexts can affect the reliability of coding. We offer suggestions for best practices in automated coding. Other smaller-scale investigations across a diverse range of coding scheme types and data contexts are needed to develop generalized principles.

引用

页码：194 / 199

页数：6

共 50 条

[21] Sensitivity Analysis of the Composite Data-Driven Pipelines in the Automated Machine Learning
Barabanova, Irina, V
Vychuzhanin, Pavel
Nikitin, Nikolay O.
10TH INTERNATIONAL YOUNG SCIENTISTS CONFERENCE IN COMPUTATIONAL SCIENCE (YSC2021), 2021, 193 : 484 - 493
[22] Application of Machine Learning Techniques for Stastical Analysis of Software Reliability Data Sets
Shanthi, D.
Mohanty, R. K.
Narsimha, G.
PROCEEDINGS OF THE 2018 SECOND INTERNATIONAL CONFERENCE ON INTELLIGENT COMPUTING AND CONTROL SYSTEMS (ICICCS), 2018, : 1472 - 1474
[23] Automated Machine Learning for Enhanced Software Reliability Growth Modeling : A Comparative Analysis with Traditional SRGMs
Kim, Taehyoun
Ryu, Duksan
Baik, Jongmoon
2024 IEEE 24TH INTERNATIONAL CONFERENCE ON SOFTWARE QUALITY, RELIABILITY AND SECURITY, QRS, 2024, : 483 - 493
[24] Machine Learning Training on Encrypted Data with TFHE
Montero, Luis
Frery, Jordan
Kherfallah, Celia
Bredehoft, Roman
Stoian, Andrei
PROCEEDINGS OF THE 10TH ACM INTERNATIONAL WORKSHOP ON SECURITY AND PRIVACY ANALYTICS, IWSPA 2024, 2024, : 71 - 76
[25] Effect of Training Data Order for Machine Learning
Mange, Jeremy
2019 6TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL SCIENCE AND COMPUTATIONAL INTELLIGENCE (CSCI 2019), 2019, : 406 - 407
[26] Impact of Training and Testing Data Splits on Accuracy of Time Series Forecasting in Machine Learning
Medar, Ramesh
Rajpurohit, Vijay S.
Rashmi, B.
2017 INTERNATIONAL CONFERENCE ON COMPUTING, COMMUNICATION, CONTROL AND AUTOMATION (ICCUBEA), 2017,
[27] Automated Retrieval of Heterogeneous Proteomic Data for Machine Learning
Rafay, Abdul
Aziz, Muzzamil
Zia, Amjad
Asif, Abdul R. R.
JOURNAL OF PERSONALIZED MEDICINE, 2023, 13 (05):
[28] AUTOMATED MACHINE LEARNING & SYNTHETIC DATA APPLICATIONS IN MEDICINE
Rashidi, Hooman
INTERNATIONAL JOURNAL OF LABORATORY HEMATOLOGY, 2023, 45 : 93 - 93
[29] Adaptation Strategies for Automated Machine Learning on Evolving Data
Celik, Bilge
Vanschoren, Joaquin
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (09) : 3067 - 3078
[30] Fast Training Data Generation for Machine Learning Analysis of Cosmic Ray Showers
Hachaj, Tomasz
Bibrzycki, Lukasz
Piekarczyk, Marcin
IEEE ACCESS, 2023, 11 : 7410 - 7419

← 1 2 3 4 5 →