Statistical Data Generation Using Sample Data

被引：3

作者：

Fazekas, Balint ^{[1
]}

Kiss, Attila ^{[1
,2
]}

机构：

[1] Eotvos Lorand Univ, Fac Informat, Dept Informat Syst, Budapest, Hungary

[2] J Selye Univ, Komarno, Slovakia

来源：

NEW TRENDS IN DATABASES AND INFORMATION SYSTEMS, ADBIS 2018 | 2018年 / 909卷

关键词：

Clustering; Database; Data generation;

D O I：

10.1007/978-3-030-00063-9_4

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Due to the ever increasing data stored in databases, it is important to develop software which can generate large numbers of test data that reflect the properties of a given sample. By generating such data, database algorithms can be stress-tested and evaluated by their performance. If the generated data is much greater in number than the given sample, then the process is called data augmentation or synthetic data generation. Data augmentation can also be very useful in Big Data benchmarking tests. The scope of this paper is to describe a method for statistical data generation based on a given sample, where the generated result attempts to reflect the statistical properties of the sample as much as possible. Throughout the paper we explain how any given data can be represented numerically, and hence clustered using the DBSCAN and K-means algorithms. We introduce a hybrid clustering method, which combines both of the previously mentioned algorithms. The hybrid algorithm focuses on unifying the strengths of both clustering algorithms. After the data is clustered, the individual sub-clusters are statistically analyzed, and based on the analytical results pseudo-random data are generated. The results of the hybrid clustering algorithm show that such artificial data can be created, which reflect the statistical properties of any given sample.

引用

页码：29 / 36

页数：8

共 50 条

[1] Data generation processes and statistical management of interval data
Blanco-Fernandez, Angela
Winker, Peter
[J]. ASTA-ADVANCES IN STATISTICAL ANALYSIS, 2016, 100 (04) : 475 - 494
[2] Data generation processes and statistical management of interval data
Angela Blanco-Fernández
Peter Winker
[J]. AStA Advances in Statistical Analysis, 2016, 100 : 475 - 494
[3] STATISTICAL TREATMENT OF NOT NORMAL DISTRIBUTED SAMPLE DATA
HILLER, KA
FRIEDL, KH
SCHMALZ, G
[J]. JOURNAL OF DENTAL RESEARCH, 1995, 74 : 425 - 425
[4] Synthetic Data Generation for Statistical Testing
Soltana, Ghanem
Sabetzadeh, Mehrdad
Briand, Lionel C.
[J]. PROCEEDINGS OF THE 2017 32ND IEEE/ACM INTERNATIONAL CONFERENCE ON AUTOMATED SOFTWARE ENGINEERING (ASE'17), 2017, : 872 - 882
[5] STATISTICAL DATA ON A SAMPLE GROUP OF PSYCHOLOGY STUDENTS
OLERON, P
MOULINOU, M
[J]. BULLETIN DE PSYCHOLOGIE, 1967, 21 (1-4): : 11 - 26
[6] Sweave: Dynamic generation of statistical reports using literate data analysis
Leisch, F
[J]. COMPSTAT 2002: PROCEEDINGS IN COMPUTATIONAL STATISTICS, 2002, : 575 - 580
[7] VIRTUAL SAMPLE GENERATION OF HYPERSPECTRAL MINERAL DATA
Yadav, Palla Parasuram
Shetty, Amba
Raghavendra, B. S.
Narasimhadhan, A. V.
[J]. 2023 INTERNATIONAL CONFERENCE ON MACHINE INTELLIGENCE FOR GEOANALYTICS AND REMOTE SENSING, MIGARS, 2023, : 214 - 217
[8] RESULTS OF ALTERNATIVE STATISTICAL TREATMENTS OF SAMPLE SURVEY DATA
KLEIN, LR
MORGAN, JN
[J]. JOURNAL OF THE AMERICAN STATISTICAL ASSOCIATION, 1951, 46 (256) : 442 - 460
[9] Two-Sample Problems in Statistical Data Modelling
Valeinis, J.
Cers, E.
Cielens, J.
[J]. MATHEMATICAL MODELLING AND ANALYSIS, 2010, 15 (01) : 137 - 151
[10] STATISTICAL-ANALYSIS OF SMALL SAMPLE FATIGUE DATA
NISHIJIMA, S
[J]. TRANSACTIONS OF NATIONAL RESEARCH INSTITUTE FOR METALS, 1985, 27 (04): : 234 - 245

← 1 2 3 4 5 →