Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

被引:4
|
作者
Valdez-Valenzuela, Eric [1 ]
Kuri-Morales, Angel [2 ]
Gomez-Adorno, Helena [3 ]
机构
[1] Univ Nacl Autonoma Mexico, Ciencia & Ingn Comp, Ciudad De Mexico, Mexico
[2] Inst Tecnol Autonomo Mexico, Ciudad De Mexico, Mexico
[3] Univ Nacl Autonoma Mexico, Inst Invest Matemat Aplicadas & Sistemas, Ciudad De Mexico, Mexico
关键词
Supervised machine learning; Data preprocessing; Categorical encoding; Synthetic data;
D O I
10.1007/978-3-030-89817-5_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques' impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.
引用
收藏
页码:92 / 107
页数:16
相关论文
共 50 条
  • [31] Measuring Corporate Culture Using Machine Learning
    Li, Kai
    Mai, Feng
    Shen, Rui
    Yan, Xinyan
    REVIEW OF FINANCIAL STUDIES, 2021, 34 (07): : 3265 - 3315
  • [32] Development of a measuring method for motion accuracy of NC machine tools using links and rotary encoders
    Iwai, Hiroaki
    Mitsui, Kimiyuki
    INTERNATIONAL JOURNAL OF MACHINE TOOLS & MANUFACTURE, 2009, 49 (01): : 99 - 108
  • [33] Measuring digitalization capabilities using machine learning
    Yang, Jinglan
    Liu, Jianghuai
    Yao, Zheng
    Ma, Chaoqun
    RESEARCH IN INTERNATIONAL BUSINESS AND FINANCE, 2024, 70
  • [34] Fundamental Components and Principles of Supervised Machine Learning Workflows with Numerical and Categorical Data
    Kampezidou, Styliani I.
    Ray, Archana Tikayat
    Bhat, Anirudh Prabhakara
    Fischer, Olivia J. Pinon
    Mavris, Dimitri N.
    ENG, 2024, 5 (01): : 384 - 416
  • [35] Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data
    Anjun Chen
    Drake O. Chen
    Scientific Reports, 12
  • [36] Simulation of a machine learning enabled learning health system for risk prediction using synthetic patient data
    Chen, Anjun
    Chen, Drake O.
    SCIENTIFIC REPORTS, 2022, 12 (01)
  • [37] A Machine Learning Approach for Automated Filling of Categorical Fields in Data Entry Forms
    Belgacem, Hichem
    Li, Xiaochen
    Bianculli, Domenico
    Briand, Lionel
    ACM TRANSACTIONS ON SOFTWARE ENGINEERING AND METHODOLOGY, 2023, 32 (02)
  • [38] Machine learning based predictive action on categorical non-sequential data
    Pradeep S.
    Kallimani J.S.
    Recent Advances in Computer Science and Communications, 2020, 13 (05) : 1020 - 1030
  • [39] Early Prediction of Neonatal Sepsis From Synthetic Clinical Data Using Machine Learning
    Lyra, Simon
    Jin, Jinyi
    Leonhardt, Steffen
    Lueken, Markus
    2023 45TH ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE & BIOLOGY SOCIETY, EMBC, 2023,
  • [40] Comparison of Machine Learning Approaches for Reconstructing Sea Subsurface Salinity Using Synthetic Data
    Tian, Tian
    Leng, Hongze
    Wang, Gongjie
    Li, Guancheng
    Song, Junqiang
    Zhu, Jiang
    An, Yuzhu
    REMOTE SENSING, 2022, 14 (22)