The Social Construction of Categorical Data: Mixed Methods Approach to Assessing Data Features in Publicly Available Datasets

被引:1
|
作者
Willem, Theresa [1 ,2 ]
Wollek, Alessandro [3 ]
Cheslerean-Boghiu, Theodor [3 ]
Kenney, Martha [4 ]
Buyx, Alena [1 ]
机构
[1] Tech Univ Munich, Inst Hist & Eth Med, Sch Med & Hlth, Ismaningerstr 22, D-81675 Munich, Germany
[2] Helmholtz Munich, Helmholtz AI, Munich, Germany
[3] Tech Univ Munich, Munich Inst Biomed Engn, Sch Computat Informat & Technol, Munich, Germany
[4] San Francisco State Univ, Women & Gender Studies, San Francisco, CA USA
关键词
machine learning; categorical data; social context dependency; mixed methods; dermatology; dataset analysis; RACE; HEALTH; CENSUS;
D O I
10.2196/59452
中图分类号
R-058 [];
学科分类号
摘要
Background: In data-sparse areas such as health care, computer scientists aim to leverage as much available information as possible to increase the accuracy of their machine learning models' outputs. As a standard, categorical data, such as patients' gender, socioeconomic status, or skin color, are used to train models in fusion with other data types, such as medical images and text-based medical information. However, the effects of including categorical data features for model training in such data-scarce areas are underexamined, particularly regarding models intended to serve individuals equitably in a diverse population. Objective: This study aimed to explore categorical data's effects on machine learning model outputs, rooted the effects in the data collection and dataset publication processes, and proposed a mixed methods approach to examining datasets' data categories Methods: Against the theoretical background of the social construction of categories, we suggest a mixed methods approach to assess categorical data's utility for machine learning model training. As an example, we applied our approach to a Brazilian dermatological dataset (Dermatological and Surgical Assistance Program at the Federal University of Espirito Santo [PAD-UFES] 20). We first present an exploratory, quantitative study that assesses the effects when including or excluding each of the unique categorical data features of the PAD-UFES 20 dataset for training a transformer-based model using a data fusion algorithm. We then pair our quantitative analysis with a qualitative examination of the data categories based on interviews with the dataset Results: Our quantitative study suggests scattered effects of including categorical data for machine learning model training across predictive classes. Our qualitative analysis gives insights into how the categorical data were collected and why they were published, explaining some of the quantitative effects that we observed. Our findings highlight the social constructedness of categorical data in publicly available datasets, meaning that the data in a category heavily depend on both how these categories are defined by the dataset creators and the sociomedico context in which the data are collected. This reveals relevant limitations of using publicly available datasets in contexts different from those of the collection of their data. Conclusions: We caution against using data features of publicly available datasets without reflection on the social construction and context dependency of their categorical data features, particularly in data-sparse areas. We conclude that social scientific, context-dependent analysis of available data features using both quantitative and qualitative methods is helpful in judging the utility of categorical data for the population for which a model is intended.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Review on publicly available datasets for educational data mining
    Mihaescu, Marian Cristian
    Popescu, Paul Stefan
    WILEY INTERDISCIPLINARY REVIEWS-DATA MINING AND KNOWLEDGE DISCOVERY, 2021, 11 (03)
  • [2] A novel approach for validating raster datasets with categorical data
    Dobos, E.
    Vadnai, P.
    Bertoti, D.
    Kovacs, K.
    Micheli, E.
    Lang, V.
    Fuchs, M.
    GLOBALSOILMAP: BASIS OF THE GLOBAL SPATIAL SOIL INFORMATION SYSTEM, 2014, : 347 - 351
  • [3] Development of a data science CURE in microbiology using publicly available microbiome datasets
    Sun, Evelyn
    Konig, Stephan G.
    Cirstea, Mihai
    Hallam, Steven J.
    Graves, Marcia L.
    Oliver, David C.
    FRONTIERS IN MICROBIOLOGY, 2022, 13
  • [4] Inferring Urban Social Networks from Publicly Available Data
    Guarino, Stefano
    Mastrostefano, Enrico
    Bernaschi, Massimo
    Celestini, Alessandro
    Cianfriglia, Marco
    Torre, Davide
    Zastrow, Lena Rebecca
    FUTURE INTERNET, 2021, 13 (05):
  • [5] Data Inference From Publicly Available Data: Threats and Defense Methods in Power Systems
    Wang, Zijun
    Liu, Yang
    Yu, Nanpeng
    Wu, Qinqin
    Wu, Jiang
    Zhou, Yadong
    Liu, Ting
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2025, 40 (01) : 1049 - 1059
  • [6] A unified approach for assessing agreement for continuous and categorical data
    Lin, Lawrence
    Hedayat, A. S.
    Wu, Wenting
    JOURNAL OF BIOPHARMACEUTICAL STATISTICS, 2007, 17 (04) : 629 - 652
  • [7] ASSESSING THE FEASIBILITY OF OBTAINING PRODUCT INGREDIENT DATA FROM PUBLICLY AVAILABLE SOURCES
    BYER, WL
    JOURNAL OF CHEMICAL INFORMATION AND COMPUTER SCIENCES, 1982, 22 (04): : 190 - 195
  • [8] Assessing human-induced pressures on coastal areas with publicly available data
    Lopez y Royo, Cecilia
    Silvestri, Cecilia
    Pergent, Gerard
    Casazza, Gianna
    JOURNAL OF ENVIRONMENTAL MANAGEMENT, 2009, 90 (03) : 1494 - 1501
  • [9] Assessing the Potential Value of Commercial Mortality Data Combined With Publicly Available Death Data For Pharmacoepidemiology Research
    Reynolds, Matthew W.
    Collins, Jenna M.
    Meadows, Eric S.
    Zhang, Qianyi
    Dolor, Aaron
    Wade, Niquelle
    Mathur, Raina
    Castellanos, Emily
    PHARMACOEPIDEMIOLOGY AND DRUG SAFETY, 2024, 33 : 603 - 604
  • [10] Utilizing Publicly Available Community Data to Address Social Determinants of Health: A Compendium of Data Sources
    Lindenfeld, Zoe
    Pagan, Jose A.
    Chang, Ji Eun
    INQUIRY-THE JOURNAL OF HEALTH CARE ORGANIZATION PROVISION AND FINANCING, 2023, 60