Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

被引:4
|
作者
Valdez-Valenzuela, Eric [1 ]
Kuri-Morales, Angel [2 ]
Gomez-Adorno, Helena [3 ]
机构
[1] Univ Nacl Autonoma Mexico, Ciencia & Ingn Comp, Ciudad De Mexico, Mexico
[2] Inst Tecnol Autonomo Mexico, Ciudad De Mexico, Mexico
[3] Univ Nacl Autonoma Mexico, Inst Invest Matemat Aplicadas & Sistemas, Ciudad De Mexico, Mexico
关键词
Supervised machine learning; Data preprocessing; Categorical encoding; Synthetic data;
D O I
10.1007/978-3-030-89817-5_7
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques' impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.
引用
收藏
页码:92 / 107
页数:16
相关论文
共 50 条
  • [1] Statistical Evaluation of Categorical Encoders for Pattern Preservation in Machine Learning Tasks
    Valdez-Valenzuela, Eric
    Kuri-Morales, Angel
    Gomez-Adorno, Helena
    INTERNATIONAL JOURNAL OF COMBINATORIAL OPTIMIZATION PROBLEMS AND INFORMATICS, 2024, 15 (02): : 160 - 172
  • [2] On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks
    Hittmeir, Markus
    Ekelhart, Andreas
    Mayer, Rudolf
    14TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY (ARES 2019), 2019,
  • [3] Data Quality for Machine Learning Tasks
    Gupta, Nitin
    Mujumdar, Shashank
    Patel, Hima
    Masuda, Satoshi
    Panwar, Naveen
    Bandyopadhyay, Sambaran
    Mehta, Sameep
    Guttula, Shanmukha
    Afzal, Shazia
    Mittal, Ruhi Sharma
    Munigala, Vitobha
    KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4040 - 4041
  • [4] Topological Machine Learning for Mixed Numeric and Categorical Data
    Wu, Chengyuan
    Hargreaves, Carol Anne
    INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2021, 30 (05)
  • [5] Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks
    Dina, Ayesha Siddiqua
    Siddique, A.B.
    Manivannan, D.
    IEEE Access, 2022, 10 : 96731 - 96747
  • [6] Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks
    Dina, Ayesha Siddiqua
    Siddique, A. B.
    Manivannan, D.
    IEEE ACCESS, 2022, 10 : 96731 - 96747
  • [7] Machine learning and the politics of synthetic data
    Jacobsen, Benjamin N.
    BIG DATA & SOCIETY, 2023, 10 (01)
  • [8] Applications of machine learning to behavioral sciences: focus on categorical data
    Dehghan, Pegah
    Alashwal, Hany
    Moustafa, Ahmed A.
    DISCOVER PSYCHOLOGY, 2022, 2 (01):
  • [9] Machine Learning Based Missing Data Imputation in Categorical Datasets
    Ishaq, Muhammad
    Zahir, Sana
    Iftikhar, Laila
    Bulbul, Mohammad Farhad
    Rho, Seungmin
    Lee, Mi Young
    IEEE ACCESS, 2024, 12 : 88332 - 88344
  • [10] Using Imbalanced Triangle Synthetic Data for Machine Learning Anomaly Detection
    Luo, Menghua
    Wang, Ke
    Cai, Zhiping
    Liu, Anfeng
    Li, Yangyang
    Cheang, Chak Fong
    CMC-COMPUTERS MATERIALS & CONTINUA, 2019, 58 (01): : 15 - 26