Measuring the Effect of Categorical Encoders in Machine Learning Tasks Using Synthetic Data

被引：4

作者：

Valdez-Valenzuela, Eric ^{[1
]}

Kuri-Morales, Angel ^{[2
]}

Gomez-Adorno, Helena ^{[3
]}

机构：

[1] Univ Nacl Autonoma Mexico, Ciencia & Ingn Comp, Ciudad De Mexico, Mexico

[2] Inst Tecnol Autonomo Mexico, Ciudad De Mexico, Mexico

[3] Univ Nacl Autonoma Mexico, Inst Invest Matemat Aplicadas & Sistemas, Ciudad De Mexico, Mexico

来源：

ADVANCES IN COMPUTATIONAL INTELLIGENCE (MICAI 2021), PT I | 2021年 / 13067卷

关键词：

Supervised machine learning; Data preprocessing; Categorical encoding; Synthetic data;

D O I：

10.1007/978-3-030-89817-5_7

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Most of the datasets used in Machine Learning (ML) tasks contain categorical attributes. In practice, these attributes must be numerically encoded for their use in supervised learning algorithms. Although there are several encoding techniques, the most commonly used ones do not necessarily preserve possible patterns embedded in the data when they are applied inappropriately. This potential loss of information affects the performance of ML algorithms in automated learning tasks. In this paper, a comparative study is presented to measure how the different encoding techniques affect the performance of machine learning models. We test 10 encoding methods, using 5 ML algorithms on real and synthetic data. Furthermore, we propose a novel approach that uses synthetically created datasets that allows us to know a priori the relationship between the independent and the dependent variables, which implies a more precise measurement of the encoding techniques' impact. We show that some ML models are affected negatively or positively depending on the encoding technique used. We also show that the proposed approach is more easily controlled and faster when performing experiments on categorical encoders.

引用

页码：92 / 107

页数：16

共 50 条

[1] Statistical Evaluation of Categorical Encoders for Pattern Preservation in Machine Learning Tasks
Valdez-Valenzuela, Eric
Kuri-Morales, Angel
Gomez-Adorno, Helena
INTERNATIONAL JOURNAL OF COMBINATORIAL OPTIMIZATION PROBLEMS AND INFORMATICS, 2024, 15 (02): : 160 - 172
[2] On the Utility of Synthetic Data: An Empirical Evaluation on Machine Learning Tasks
Hittmeir, Markus
Ekelhart, Andreas
Mayer, Rudolf
14TH INTERNATIONAL CONFERENCE ON AVAILABILITY, RELIABILITY AND SECURITY (ARES 2019), 2019,
[3] Data Quality for Machine Learning Tasks
Gupta, Nitin
Mujumdar, Shashank
Patel, Hima
Masuda, Satoshi
Panwar, Naveen
Bandyopadhyay, Sambaran
Mehta, Sameep
Guttula, Shanmukha
Afzal, Shazia
Mittal, Ruhi Sharma
Munigala, Vitobha
KDD '21: PROCEEDINGS OF THE 27TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY & DATA MINING, 2021, : 4040 - 4041
[4] Topological Machine Learning for Mixed Numeric and Categorical Data
Wu, Chengyuan
Hargreaves, Carol Anne
INTERNATIONAL JOURNAL ON ARTIFICIAL INTELLIGENCE TOOLS, 2021, 30 (05)
[5] Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks
Dina, Ayesha Siddiqua
Siddique, A.B.
Manivannan, D.
IEEE Access, 2022, 10 : 96731 - 96747
[6] Effect of Balancing Data Using Synthetic Data on the Performance of Machine Learning Classifiers for Intrusion Detection in Computer Networks
Dina, Ayesha Siddiqua
Siddique, A. B.
Manivannan, D.
IEEE ACCESS, 2022, 10 : 96731 - 96747
[7] Machine learning and the politics of synthetic data
Jacobsen, Benjamin N.
BIG DATA & SOCIETY, 2023, 10 (01)
[8] Applications of machine learning to behavioral sciences: focus on categorical data
Dehghan, Pegah
Alashwal, Hany
Moustafa, Ahmed A.
DISCOVER PSYCHOLOGY, 2022, 2 (01):
[9] Machine Learning Based Missing Data Imputation in Categorical Datasets
Ishaq, Muhammad
Zahir, Sana
Iftikhar, Laila
Bulbul, Mohammad Farhad
Rho, Seungmin
Lee, Mi Young
IEEE ACCESS, 2024, 12 : 88332 - 88344
[10] Using Imbalanced Triangle Synthetic Data for Machine Learning Anomaly Detection
Luo, Menghua
Wang, Ke
Cai, Zhiping
Liu, Anfeng
Li, Yangyang
Cheang, Chak Fong
CMC-COMPUTERS MATERIALS & CONTINUA, 2019, 58 (01): : 15 - 26

← 1 2 3 4 5 →