Statistical Data Generation Using Sample Data

被引:3
|
作者
Fazekas, Balint [1 ]
Kiss, Attila [1 ,2 ]
机构
[1] Eotvos Lorand Univ, Fac Informat, Dept Informat Syst, Budapest, Hungary
[2] J Selye Univ, Komarno, Slovakia
关键词
Clustering; Database; Data generation;
D O I
10.1007/978-3-030-00063-9_4
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Due to the ever increasing data stored in databases, it is important to develop software which can generate large numbers of test data that reflect the properties of a given sample. By generating such data, database algorithms can be stress-tested and evaluated by their performance. If the generated data is much greater in number than the given sample, then the process is called data augmentation or synthetic data generation. Data augmentation can also be very useful in Big Data benchmarking tests. The scope of this paper is to describe a method for statistical data generation based on a given sample, where the generated result attempts to reflect the statistical properties of the sample as much as possible. Throughout the paper we explain how any given data can be represented numerically, and hence clustered using the DBSCAN and K-means algorithms. We introduce a hybrid clustering method, which combines both of the previously mentioned algorithms. The hybrid algorithm focuses on unifying the strengths of both clustering algorithms. After the data is clustered, the individual sub-clusters are statistically analyzed, and based on the analytical results pseudo-random data are generated. The results of the hybrid clustering algorithm show that such artificial data can be created, which reflect the statistical properties of any given sample.
引用
收藏
页码:29 / 36
页数:8
相关论文
共 50 条
  • [1] Data generation processes and statistical management of interval data
    Blanco-Fernandez, Angela
    Winker, Peter
    [J]. ASTA-ADVANCES IN STATISTICAL ANALYSIS, 2016, 100 (04) : 475 - 494
  • [2] Data generation processes and statistical management of interval data
    Angela Blanco-Fernández
    Peter Winker
    [J]. AStA Advances in Statistical Analysis, 2016, 100 : 475 - 494
  • [3] STATISTICAL TREATMENT OF NOT NORMAL DISTRIBUTED SAMPLE DATA
    HILLER, KA
    FRIEDL, KH
    SCHMALZ, G
    [J]. JOURNAL OF DENTAL RESEARCH, 1995, 74 : 425 - 425
  • [4] Synthetic Data Generation for Statistical Testing
    Soltana, Ghanem
    Sabetzadeh, Mehrdad
    Briand, Lionel C.
    [J]. PROCEEDINGS OF THE 2017 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE'17), 2017, : 872 - 882
  • [5] STATISTICAL DATA ON A SAMPLE GROUP OF PSYCHOLOGY STUDENTS
    OLERON, P
    MOULINOU, M
    [J]. BULLETIN DE PSYCHOLOGIE, 1967, 21 (1-4): : 11 - 26
  • [6] Sweave: Dynamic generation of statistical reports using literate data analysis
    Leisch, F
    [J]. COMPSTAT 2002: PROCEEDINGS IN COMPUTATIONAL STATISTICS, 2002, : 575 - 580
  • [7] VIRTUAL SAMPLE GENERATION OF HYPERSPECTRAL MINERAL DATA
    Yadav, Palla Parasuram
    Shetty, Amba
    Raghavendra, B. S.
    Narasimhadhan, A. V.
    [J]. 2023 INTERNATIONAL CONFERENCE ON MACHINE INTELLIGENCE FOR GEOANALYTICS AND REMOTE SENSING, MIGARS, 2023, : 214 - 217
  • [8] RESULTS OF ALTERNATIVE STATISTICAL TREATMENTS OF SAMPLE SURVEY DATA
    KLEIN, LR
    MORGAN, JN
    [J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1951, 46 (256) : 442 - 460
  • [9] Two-Sample Problems in Statistical Data Modelling
    Valeinis, J.
    Cers, E.
    Cielens, J.
    [J]. MATHEMATICAL MODELLING AND ANALYSIS, 2010, 15 (01) : 137 - 151
  • [10] STATISTICAL-ANALYSIS OF SMALL SAMPLE FATIGUE DATA
    NISHIJIMA, S
    [J]. TRANSACTIONS OF NATIONAL RESEARCH INSTITUTE FOR METALS, 1985, 27 (04): : 234 - 245