Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

被引:4
|
作者
Valdez-Valenzuela, Eric [1 ]
Kuri-Morales, Angel [2 ]
Gomez-Adorno, Helena [3 ]
机构
[1] Univ Nacl Autonoma Mexico, Ciencia & Ingn Comp, Ciudad De Mexico, Mexico
[2] Inst Tecnol Autonomo Mexico, Ciudad De Mexico, Mexico
[3] Univ Nacl Autonoma Mexico, Inst Invest Matemat Aplicadas & Sistemas, Ciudad De Mexico, Mexico
关键词
Supervised machine learning; Data preprocessing; Categorical encoding; Synthetic data;
D O I
10.1007/978-3-030-89817-5_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques' impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.
引用
收藏
页码:92 / 107
页数:16
相关论文
共 50 条
  • [21] Machine learning integrated credibilistic semi supervised clustering for categorical data
    Sarkar, Jnanendra Prasad
    Saha, Indrajit
    Chakraborty, Sinjan
    Maulik, Ujjwal
    APPLIED SOFT COMPUTING, 2020, 86
  • [22] Imperfection Sensitivity Detection in Pultruded Columns Using Machine Learning and Synthetic Data
    Tzimas, Michail
    Barbero, Ever J.
    BUILDINGS, 2024, 14 (04)
  • [23] Emergency Shutdown Valve damage classification by machine learning using synthetic data
    de Gouveia, S. M.
    Correa, L. de Abreu
    Teles, D. B.
    Oliveira, M.
    Clarke, T. G. R.
    ENGINEERING FAILURE ANALYSIS, 2024, 156
  • [24] Machine Vision for Collaborative Robotics Using Synthetic Data-Driven Learning
    Camilo Martinez-Franco, Juan
    Alvarez-Martinez, David
    SERVICE ORIENTED, HOLONIC AND MULTI-AGENT MANUFACTURING SYSTEMS FOR INDUSTRY OF THE FUTURE, SOHOMA LATIN AMERICA 2021, 2021, 987 : 69 - 81
  • [25] Machine Learning Approaches for Prediction of Facial Rejuvenation Using Real and Synthetic Data
    Shah, Syed Afaq Ali
    Bennamoun, Mohammed
    Molton, Michael K.
    IEEE ACCESS, 2019, 7 : 23779 - 23787
  • [26] Study of the Learning Algorithm for Multivariable Data Analysis in Machine Learning Tasks under Missing Data
    Aguilar, Jose
    Pinto, Angel
    Puerto, Eduard
    Rivero, Yair
    2024 L LATIN AMERICAN COMPUTER CONFERENCE, CLEI 2024, 2024,
  • [27] Facilitating and Managing Machine Learning and Data Analysis Tasks in Big Data Environments using Web and Microservice Technologies
    Shahoud, Shadi
    Gunnarsdottir, Sonja
    Khalloof, Hatem
    Duepmeier, Clemens
    Hagenmeyer, Veit
    11TH INTERNATIONAL CONFERENCE ON MANAGEMENT OF DIGITAL ECOSYSTEMS (MEDES), 2019, : 80 - 87
  • [28] AUTOMATED MACHINE LEARNING & SYNTHETIC DATA APPLICATIONS IN MEDICINE
    Rashidi, Hooman
    INTERNATIONAL JOURNAL OF LABORATORY HEMATOLOGY, 2023, 45 : 93 - 93
  • [29] Synthetic data enable experiments in atomistic machine learning
    Gardner, John L. A.
    Beaulieu, Zoe Faure
    Deringer, Volker L.
    DIGITAL DISCOVERY, 2023, 2 (03): : 651 - 662
  • [30] Synthetic data as an enabler for machine learning applications in medicine
    Rajotte, Jean-Francois
    Bergen, Robert
    Buckeridge, David L.
    El Emam, Khaled
    Ng, Raymond
    Strome, Elissa
    ISCIENCE, 2022, 25 (11)