Exploring the Utility of Anonymized EHR Datasets in Machine Learning Experiments in the Context of the MODELHealth Project

被引:0
|
作者
Pitoglou, Stavros [1 ,2 ]
Filntisi, Arianna [2 ]
Anastasiou, Athanasios [1 ]
Matsopoulos, George K. [1 ]
Koutsouris, Dimitrios [1 ]
机构
[1] Natl Tech Univ Athens, Sch Elect & Comp Engn, Athens 15780, Greece
[2] Comp Solut SA, Athens 11527, Greece
来源
APPLIED SCIENCES-BASEL | 2022年 / 12卷 / 12期
关键词
machine learning; anonymization; Mondrian; HEALTH-CARE; BIG DATA; PRIVACY; ALGORITHMS; SECURITY; THREATS;
D O I
10.3390/app12125942
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
The object of this paper was the application of machine learning to a clinical dataset that was anonymized using the Mondrian algorithm. (1) Background: The preservation of patient privacy is a necessity rising from the increasing digitization of health data; however, the effect of data anonymization on the performance of machine learning models remains to be explored. (2) Methods: The original EHR derived dataset was subjected to anonymization by applying the Mondrian algorithm for various k values and quasi identifier (QI) set attributes. The logistic regression, decision trees, k-nearest neighbors, Gaussian naive Bayes and support vector machine models were applied to the different dataset versions. (3) Results: The classifiers demonstrated different degrees of resilience to the anonymization, with the decision tree and the KNN models showing remarkably stable performance, as opposed to the Gaussian naive Bayes model. The choice of the QI set attributes and the generalized information loss value played a more important role than the size of the QI set or the k value. (4) Conclusions: Data anonymization can reduce the performance of certain machine learning models, although the appropriate selection of classifier and parameter values can mitigate this effect.
引用
收藏
页数:20
相关论文
共 37 条
  • [1] Measuring the impact of anonymization on real-world consolidated health datasets engineered for secondary research use: Experiments in the context of MODELHealth project
    Pitoglou, Stavros
    Filntisi, Arianna
    Anastasiou, Athanasios
    Matsopoulos, George K.
    Koutsouris, Dimitrios
    [J]. FRONTIERS IN DIGITAL HEALTH, 2022, 4
  • [2] APPLICATION OF MACHINE LEARNING TO LIMITED DATASETS: PREDICTION OF PROJECT SUCCESS
    Bang, Sofie
    Aarvold, Magnus O.
    Hartvig, Wilhelm J.
    Olsson, Nils O. E.
    Rauzy, Antoine
    [J]. JOURNAL OF INFORMATION TECHNOLOGY IN CONSTRUCTION, 2022, 27 : 732 - 755
  • [3] Exploring new useful phosphors by combining experiments with machine learning
    Takeda, Takashi
    Koyama, Yukinori
    Ikeno, Hidekazu
    Matsuishi, Satoru
    Hirosaki, Naoto
    [J]. Science and Technology of Advanced Materials, 2024, 25 (01)
  • [4] Exploring the Usefulness of Machine Learning in the Context of WebRTC Performance Estimation
    Ammar, Doreid
    De Moor, Katrien
    Skorin-Kapov, Lea
    Fiedler, Markus
    Heegaard, Poul E.
    [J]. PROCEEDINGS OF THE IEEE LCN: 2019 44TH ANNUAL IEEE CONFERENCE ON LOCAL COMPUTER NETWORKS (LCN 2019), 2019, : 406 - 413
  • [5] Learning machine learning with young children: exploring informal settings in an African context
    Sanusi, Ismaila Temitayo
    Sunday, Kissinger
    Oyelere, Solomon Sunday
    Suhonen, Jarkko
    Vartiainen, Henriikka
    Tukiainen, Markku
    [J]. COMPUTER SCIENCE EDUCATION, 2024, 34 (02) : 161 - 192
  • [6] Exploring Machine Learning in Chemistry through the Classification of Spectra: An Undergraduate Project
    St James, Alanah Grant
    Hand, Luke
    Mills, Thomas
    Song, Liwen
    Brunt, Annabel S. J.
    Mann, Patrick E. Bergstrom
    Worrall, Andrew F.
    Stewart, Malcolm I.
    Vallance, Claire
    [J]. JOURNAL OF CHEMICAL EDUCATION, 2023, 100 (03) : 1343 - 1350
  • [7] Novel Machine Learning Experiments with Artificially Generated Big Data from Small Immunotherapy Datasets
    Mahmoud, Ahsanullah Yunas
    Neagu, Daniel
    Scrimieri, Daniele
    Abdullatif, Amr Rashad Ahmed
    [J]. 2022 21ST IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS, ICMLA, 2022, : 986 - 991
  • [8] Exploring online public survey lifestyle datasets with statistical analysis, machine learning and semantic ontology
    Ayan Chatterjee
    Michael A. Riegler
    Miriam Sinkerud Johnson
    Jishnu Das
    Nibedita Pahari
    Raghavendra Ramachandra
    Bikramaditya Ghosh
    Arpan Saha
    Ram Bajpai
    [J]. Scientific Reports, 14 (1)
  • [9] Exploring Sensitivity of ICF Outputs to Design Parameters in Experiments Using Machine Learning
    Nakhleh, Julia B.
    Fernandez-Godino, M. Giselle
    Grosskopf, Michael J.
    Wilson, Brandon M.
    Kline, John
    Srinivasan, Gowri
    [J]. IEEE TRANSACTIONS ON PLASMA SCIENCE, 2021, 49 (07) : 2238 - 2246
  • [10] Multiattribute Based Machine Learning Models for Severity Prediction in Cross Project Context
    Sharma, Meera
    Kumari, Madhu
    Singh, R. K.
    Singh, V. B.
    [J]. COMPUTATIONAL SCIENCE AND ITS APPLICATIONS - ICCSA 2014, PT V, 2014, 8583 : 227 - +