A Comparative Study of Synthetic Dataset Generation Techniques

被引:7
|
作者
Dandekar, Ashish [1 ]
Zen, Remmy A. M. [1 ]
Bressan, Stephane [1 ]
机构
[1] Natl Univ Singapore, Singapore, Singapore
基金
新加坡国家研究基金会;
关键词
Synthetic datasets; Risk of disclosure; Privacy; Utility; IDENTIFICATION DISCLOSURE; RISKS;
D O I
10.1007/978-3-319-98812-2_35
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Unrestricted availability of the datasets is important for the researchers to evaluate their strategies to solve the research problems. While publicly releasing the datasets, it is equally important to protect the privacy of the respective data owners. Synthetic datasets that preserve the utility while protecting the privacy of the data owners stands as a midway. There are two ways to synthetically generate the data. Firstly, one can generate a fully synthetic dataset by subsampling it from a synthetically generated population. This technique is known as fully synthetic dataset generation. Secondly, one can generate a partially synthetic dataset by synthesizing the values of sensitive attributes. This technique is known as partially synthetic dataset generation. The datasets generated by these two techniques vary in their utilities as well as in their risks of disclosure. We perform a comparative study of these techniques with the use of different dataset synthesisers such as linear regression, decision tree, random forest and neural network. We evaluate the effectiveness of these techniques towards the amounts of utility that they preserve and the risks of disclosure that they suffer. We find decision tree to be an efficient and a competitively effective dataset synthesiser.
引用
收藏
页码:387 / 395
页数:9
相关论文
共 50 条
  • [1] Review and analysis of synthetic dataset generation methods and techniques for application in computer vision
    Goran Paulin
    Marina Ivasic‐Kos
    [J]. Artificial Intelligence Review, 2023, 56 : 9221 - 9265
  • [2] Review and analysis of synthetic dataset generation methods and techniques for application in computer vision
    Paulin, Goran
    Ivasic-Kos, Marina
    [J]. ARTIFICIAL INTELLIGENCE REVIEW, 2023, 56 (09) : 9221 - 9265
  • [3] Synthetic Dataset Generation of Driver Telematics
    So, Banghee
    Boucher, Jean-Philippe
    Valdez, Emiliano A.
    [J]. RISKS, 2021, 9 (04)
  • [4] Generation and study of the synthetic brain electron microscopy dataset for segmentation purpose
    Sokolov, N. A.
    Vasiliev, E. P.
    Getmanskaya, A. A.
    [J]. COMPUTER OPTICS, 2023, 47 (05) : 778 - 787
  • [5] Synthetic Dataset Generation for Fairer Unfairness Research
    Jiang, Lan
    Belitz, Clara
    Bosch, Nigel
    [J]. FOURTEENTH INTERNATIONAL CONFERENCE ON LEARNING ANALYTICS & KNOWLEDGE, LAK 2024, 2024, : 200 - 209
  • [6] A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset
    Cinthia M. Souza
    Magali R. G. Meireles
    Paulo E. M. Almeida
    [J]. Scientometrics, 2021, 126 : 135 - 156
  • [7] A comparative study of abstractive and extractive summarization techniques to label subgroups on patent dataset
    Souza, Cinthia M.
    Meireles, Magali R. G.
    Almeida, Paulo E. M.
    [J]. SCIENTOMETRICS, 2021, 126 (01) : 135 - 156
  • [8] MedWGAN based synthetic dataset generation for Uveitis pathology
    Sliman, Heithem
    Megdiche, Imen
    Alajramy, Loay
    Taweel, Adel
    Yangui, Sami
    Drira, Aida
    Lamine, Elyes
    [J]. INTELLIGENT SYSTEMS WITH APPLICATIONS, 2023, 18
  • [9] Synthetic time series dataset generation for unsupervised autoencoders
    Klopries, Hendrik
    Torres, David Orlando Salazar
    Schwung, Andreas
    [J]. 2022 IEEE 27TH INTERNATIONAL CONFERENCE ON EMERGING TECHNOLOGIES AND FACTORY AUTOMATION (ETFA), 2022,
  • [10] On the synthetic dataset generation for IPTV services based on user behavior
    Abdollahpouri, Alireza
    Qavami, Reyhan
    Moradi, Parham
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2018, 77 (07) : 8475 - 8493