Conditional Synthetic Data Generation for Robust Machine Learning Applications with Limited Pandemic Data

被引:0
|
作者
Das, Hari Prasanna [1 ]
Tran, Ryan [1 ]
Singh, Japjot [1 ]
Yue, Xiangyu [1 ]
Tison, Geoffrey [2 ]
Sangiovanni-Vincentelli, Alberto [1 ]
Spanos, Costas J. [1 ]
机构
[1] Univ Calif Berkeley, Dept Elect Engn & Comp Sci, Berkeley, CA 94720 USA
[2] Univ Calif San Francisco UCSF, Div Cardiol, San Francisco, CA USA
基金
新加坡国家研究基金会;
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Background: At the onset of a pandemic, such as COVID-19, data with proper labeling/attributes corresponding to the new disease might be unavailable or sparse. Machine Learning (ML) models trained with the available data, which is limited in quantity and poor in diversity, will often be biased and inaccurate. At the same time, ML algorithms designed to fight pandemics must have good performance and be developed in a time-sensitive manner. To tackle the challenges of limited data, and label scarcity in the available data, we propose generating conditional synthetic data, to be used alongside real data for developing robust ML models. Methods: We present a hybrid model consisting of a conditional generative flow and a classifier for conditional synthetic data generation. The classifier decouples the feature representation for the condition, which is fed to the flow to extract the local noise. We generate synthetic data by manipulating the local noise with fixed conditional feature representation. We also propose a semi-supervised approach to generate synthetic samples in the absence of labels for a majority of the available data. Results: We performed conditional synthetic generation for chest computed tomography (CT) scans corresponding to normal, COVID-19, and pneumonia afflicted patients. We show that our method significantly outperforms existing models both on qualitative and quantitative performance, and our semi-supervised approach can efficiently synthesize conditional samples under label scarcity. As an example of downstream use of synthetic data, we show improvement in COVID-19 detection from CT scans with conditional synthetic data augmentation.
引用
收藏
页码:11792 / 11800
页数:9
相关论文
共 50 条
  • [31] Synthetic data enable experiments in atomistic machine learning
    Gardner, John L. A.
    Beaulieu, Zoe Faure
    Deringer, Volker L.
    DIGITAL DISCOVERY, 2023, 2 (03): : 651 - 662
  • [32] Data Boost: Text Data Augmentation Through Reinforcement Learning Guided Conditional Generation
    Liu, Ruibo
    Xu, Guangxuan
    Jia, Chenyan
    Ma, Weicheng
    Wang, Lili
    Vosoughi, Soroush
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 9031 - 9041
  • [33] Synthetic Data Generation and Evaluation Techniques for Classifiers in Data Starved Medical Applications
    Bae, Wan D.
    Alkobaisi, Shayma
    Horak, Matthew
    Bankar, Siddheshwari
    Bhuvaji, Sartaj
    Kim, Sungroul
    Park, Choon-Sik
    IEEE ACCESS, 2025, 13 : 16584 - 16602
  • [34] Robust CSEM data processing by unsupervised machine learning
    Li, Guang
    He, Zhushi
    Deng, Juzhi
    Tang, Jingtian
    Fu, Youyao
    Liu, Xiaoqiong
    Shen, Changming
    JOURNAL OF APPLIED GEOPHYSICS, 2021, 186
  • [35] A robust machine learning approach to SDG data segmentation
    Kassim S. Mwitondi
    Isaac Munyakazi
    Barnabas N. Gatsheni
    Journal of Big Data, 7
  • [36] A robust machine learning approach to SDG data segmentation
    Mwitondi, Kassim S.
    Munyakazi, Isaac
    Gatsheni, Barnabas N.
    JOURNAL OF BIG DATA, 2020, 7 (01)
  • [37] Robust Channel Coding Strategies for Machine Learning Data
    Mazooji, Kayvon
    Sala, Frederic
    Van den Broeck, Guy
    Dolecek, Lara
    2016 54TH ANNUAL ALLERTON CONFERENCE ON COMMUNICATION, CONTROL, AND COMPUTING (ALLERTON), 2016, : 609 - 616
  • [38] Synthetic data generation for machine learning model training for energy theft scenarios using cosimulation
    Narayanan, Anantha
    Hardy, Trevor
    IET GENERATION TRANSMISSION & DISTRIBUTION, 2023, 17 (05) : 1035 - 1046
  • [39] Learning Vine Copula Models for Synthetic Data Generation
    Sun, Yi
    Cuesta-Infante, Alfredo
    Veeramachaneni, Kalyan
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 5049 - 5057
  • [40] Synthetic Data Generation for Deep Learning in Counting Pedestrians
    Ekbatani, Hadi Keivan
    Pujol, Oriol
    Segui, Santi
    ICPRAM: PROCEEDINGS OF THE 6TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION APPLICATIONS AND METHODS, 2017, : 318 - 323