Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

被引:4
|
作者
Valdez-Valenzuela, Eric [1 ]
Kuri-Morales, Angel [2 ]
Gomez-Adorno, Helena [3 ]
机构
[1] Univ Nacl Autonoma Mexico, Ciencia & Ingn Comp, Ciudad De Mexico, Mexico
[2] Inst Tecnol Autonomo Mexico, Ciudad De Mexico, Mexico
[3] Univ Nacl Autonoma Mexico, Inst Invest Matemat Aplicadas & Sistemas, Ciudad De Mexico, Mexico
关键词
Supervised machine learning; Data preprocessing; Categorical encoding; Synthetic data;
D O I
10.1007/978-3-030-89817-5_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques' impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.
引用
收藏
页码:92 / 107
页数:16
相关论文
共 50 条
  • [41] Machine Learning Based Flashover Prediction Models Using Synthetic Data and Fire Images
    Song, Yansheng
    Xiao, Guang
    Wang, Haoran
    FIRE TECHNOLOGY, 2025,
  • [42] Election forensics: Using machine learning and synthetic data for possible election anomaly detection
    Zhang, Mali
    Alvarez, R. Michael
    Levin, Ines
    PLOS ONE, 2019, 14 (10):
  • [43] Online Data Valuation and Pricing for Machine Learning Tasks in Mobile Health
    Xu, Anran
    Zheng, Zhenzhe
    Wu, Fan
    Chen, Guihai
    IEEE CONFERENCE ON COMPUTER COMMUNICATIONS (IEEE INFOCOM 2022), 2022, : 850 - 859
  • [44] Exploring Machine Learning Approaches for Classifying Mental Workload using fNIRS Data from HCI Tasks
    Benerradi, Johann
    Maior, Horia A.
    Marinescu, Adrian
    Clos, Jeremie
    Wilson, Max L.
    HALFWAY TO THE FUTURE SYMPOSIUM (HTTF 2019), 2019,
  • [45] Treatment effect estimation with observational network data using machine learning
    Emmenegger, Corinne
    Spohn, Meta-Lina
    Elmer, Timon
    Buhlmann, Peter
    JOURNAL OF CAUSAL INFERENCE, 2025, 13 (01)
  • [46] Effect of Data Preprocessing in the Detection of Epilepsy using Machine Learning Techniques
    Sabarivani, A.
    Ramadevi, R.
    Pandian, R.
    Krishnamoorthy, N. R.
    JOURNAL OF SCIENTIFIC & INDUSTRIAL RESEARCH, 2021, 80 (12): : 1066 - 1077
  • [47] Development of a measuring method for motion accuracy of NC machine tools using links and rotary encoders (measuring principle and experimental results)
    Iwasawa, Koichiro
    Mitsui, Kimiyuki
    Nippon Kikai Gakkai Ronbunshu, C Hen/Transactions of the Japan Society of Mechanical Engineers, Part C, 2004, 70 (08): : 2484 - 2491
  • [48] Prediction for chronic kidney disease by categorical and non_categorical attributes using different machine learning algorithms
    Saurabh Pal
    Multimedia Tools and Applications, 2023, 82 : 41253 - 41266
  • [49] Prediction for chronic kidney disease by categorical and non_categorical attributes using different machine learning algorithms
    Pal, Saurabh
    MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (26) : 41253 - 41266
  • [50] Reliability of Supervised Machine Learning Using Synthetic Data in Health Care: Model to Preserve Privacy for Data Sharing
    Rankin, Debbie
    Black, Michaela
    Bond, Raymond
    Wallace, Jonathan
    Mulvenna, Maurice
    Epelde, Gorka
    JMIR MEDICAL INFORMATICS, 2020, 8 (07)