Reimagining Synthetic Tabular Data Generation through Data-Centric AI: A Comprehensive Benchmark

被引:0
|
作者
Hansen, Lasse [1 ,2 ]
Seedat, Nabeel [3 ]
van der Schaar, Mihaela [3 ]
Petrovic, Andrija [4 ]
机构
[1] Aarhus Univ Hosp Psychiat, Dept Affect Disorders, Aarhus, Denmark
[2] Aarhus Univ, Dept Clin Med, Aarhus, Denmark
[3] Univ Cambridge, Dept Appl Math & Theoret Phys, Cambridge, England
[4] Univ Belgrade, Fac Organisat Sci, Belgrade, Serbia
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Synthetic data serves as an alternative in training machine learning models, particularly when real-world data is limited or inaccessible. However, ensuring that synthetic data mirrors the complex nuances of real-world data is a challenging task. This paper addresses this issue by exploring the potential of integrating data-centric AI techniques which profile the data to guide the synthetic data generation process. Moreover, we shed light on the often ignored consequences of neglecting these data profiles during synthetic data generation - despite seemingly high statistical fidelity. Subsequently, we propose a novel framework to evaluate the integration of data profiles to guide the creation of more representative synthetic data. In an empirical study, we evaluate the performance of five state-of-the-art models for tabular data generation on eleven distinct tabular datasets. The findings offer critical insights into the successes and limitations of current synthetic data generation techniques. Finally, we provide practical recommendations for integrating data-centric insights into the synthetic data generation process, with a specific focus on classification performance, model selection, and feature selection. This study aims to reevaluate conventional approaches to synthetic data generation and promote the application of data-centric AI techniques in improving the quality and effectiveness of synthetic data.
引用
收藏
页数:43
相关论文
共 50 条
  • [1] dcbench: A Benchmark for Data-Centric AI Systems
    Eyuboglu, Sabri
    Karlas, Bojan
    Re, Christopher
    Zhang, Ce
    Zou, James
    [J]. PROCEEDINGS OF THE 6TH WORKSHOP ON DATA MANAGEMENT FOR END-TO-END MACHINE LEARNING, DEEM 2022, 2022,
  • [2] Data-Centric AI
    Malerba, Donato
    Pasquadibisceglie, Vincenzo
    [J]. JOURNAL OF INTELLIGENT INFORMATION SYSTEMS, 2024,
  • [3] The Principles of Data-Centric AI
    Jarrahi, Mohammad Hossein
    Memariani, Ali
    Guha, Shion
    [J]. COMMUNICATIONS OF THE ACM, 2023, 66 (08) : 84 - 92
  • [4] Data-centric AI: Perspectives and Challenges
    Zha, Daochen
    Bhat, Zaid Pervaiz
    Lai, Kwei-Herng
    Yang, Fan
    Hu, Xia
    [J]. PROCEEDINGS OF THE 2023 SIAM INTERNATIONAL CONFERENCE ON DATA MINING, SDM, 2023, : 945 - 948
  • [5] Opportunities and Challenges in Data-Centric AI
    Kumar, Sushant
    Datta, Sumit
    Singh, Vishakha
    Singh, Sanjay Kumar
    Sharma, Ritesh
    [J]. IEEE ACCESS, 2024, 12 (33173-33189) : 33173 - 33189
  • [6] Data-centric AI: Techniques and Future Perspectives
    Zha, Daochen
    Lai, Kwei-Herng
    Yang, Fan
    Zou, Na
    Gao, Huiji
    Hu, Xia
    [J]. PROCEEDINGS OF THE 29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING, KDD 2023, 2023, : 5839 - 5840
  • [7] Potential Impact of Data-Centric AI on Society
    Kumar, Sushant
    Sharma, Ritesh
    Singh, Vishakha
    Tiwari, Shrikant
    Singh, Sanjay Kumar
    Datta, Sumit
    [J]. IEEE TECHNOLOGY AND SOCIETY MAGAZINE, 2023, 42 (03) : 98 - 107
  • [8] Data-Centric AI for Healthcare Fraud Detection
    Johnson J.M.
    Khoshgoftaar T.M.
    [J]. SN Computer Science, 4 (4)
  • [9] DataPerf: Benchmarks for Data-Centric AI Development
    Mazumder, Mark
    Banbury, Colby
    Yao, Xiaozhe
    Karlas, Bojan
    Rojas, William Gaviria
    Diamos, Sudnya
    Diamos, Greg
    He, Lynn
    Parrish, Alicia
    Kirk, Hannah Rose
    Quaye, Jessica
    Rastogi, Charvi
    Kiela, Douwe
    Jurado, David
    Kanter, David
    Mosquera, Rafael
    Ciro, Juan
    Aroyo, Lora
    Acun, Bilge
    Chen, Lingjiao
    Raje, Mehul Smriti
    Bartolo, Max
    Eyuboglu, Sabri
    Ghorbani, Amirata
    Goodman, Emmett
    Inel, Oana
    Kane, Tariq
    Kirkpatrick, Christine R.
    Kuo, Tzu-Sheng
    Mueller, Jonas
    Thrush, Tristan
    Vanschoren, Joaquin
    Warren, Margaret
    Williams, Adina
    Yeung, Serena
    Ardalani, Newsha
    Paritosh, Praveen
    Zhang, Ce
    Zou, James
    Wu, Carole-Jean
    Coleman, Cody
    Ng, Andrew
    Mattson, Peter
    Reddi, Vijay Janapa
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [10] A data-centric approach for ethical and trustworthy AI in journalism
    Dierickx, Laurence
    Opdahl, Andreas Lothe
    Khan, Sohail Ahmed
    Linden, Carl-Gustav
    Guerrero Rojas, Diana Carolina
    [J]. ETHICS AND INFORMATION TECHNOLOGY, 2024, 26 (04)