Development of a data generator for multivariate numerical data with arbitrary correlations and distributions

被引:1
|
作者
Vahldiek, Kai [1 ]
Zhou, Libing [1 ]
Zhu, Wenfeng [1 ]
Klawonn, Frank [1 ,2 ]
机构
[1] Ostfalia Univ Appl Sci, Dept Comp Sci, Salzdahlumer Str 46-48, D-38302 Wolfenbuttel, Germany
[2] Helmholtz Ctr Infect Res, Biostat, Braunschweig, Germany
关键词
Data generator; data sets; correlations; distribution functions; simulations; CLASSIFICATION;
D O I
10.3233/IDA-205253
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Artificial or simulated data are particularly relevant in tests and benchmarks for machine learning methods, in teaching for exercises and for setting up analysis workflows. They are relevant when real data may not be used for reasons of data protection, or when special distributions or effects should be present in the data to test certain machine learning methods. In this paper a generator for multivariate numerical data with arbitrary marginal distributions and - as far as possible - arbitrary correlations is presented. The data generator is implemented in the open source statistics software R. It can also be used for categorical variables, if data are generated separately for the corresponding characteristics of a categorical variable. Additionally, outliers can be integrated. The use of the data generator is demonstrated with a concrete example.
引用
收藏
页码:789 / 807
页数:19
相关论文
共 50 条