Data splitting for artificial neural networks using SOM-based stratified sampling

被引:167
|
作者
May, R. J. [1 ]
Maier, H. R. [2 ]
Dandy, G. C. [2 ]
机构
[1] United Water, Res & Dev, Adelaide, SA 5001, Australia
[2] Univ Adelaide, Sch Civil Environm & Mining Engn, Adelaide, SA 5005, Australia
关键词
Artificial neural networks; Data splitting; Cross-validation; Self-organizing maps; Stratified sampling; VALIDATION; PREDICTION; SELECTION; VARIANCE; MODELS; BIAS;
D O I
10.1016/j.neunet.2009.11.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the Subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:283 / 294
页数:12
相关论文
共 50 条
  • [41] Identification of polypeptides by using SOM neural networks
    Liu, Jianwei
    He, Ting
    Zhang, Bo
    Shen, Jingling
    INFRARED, MILLIMETER-WAVE, AND TERAHERTZ TECHNOLOGIES III, 2014, 9275
  • [42] Solar radiation forecasting based on meteorological data using artificial neural networks
    Ghanbarzadeh, A.
    Noghrehabadi, A. R.
    Assareh, E.
    Behrang, M. A.
    2009 7TH IEEE INTERNATIONAL CONFERENCE ON INDUSTRIAL INFORMATICS, VOLS 1 AND 2, 2009, : 227 - +
  • [43] Modeling Data Quality Using Artificial Neural Networks
    Laufer, Ralf
    Schwieger, Volker
    1ST INTERNATIONAL WORKSHOP ON THE QUALITY OF GEODETIC OBSERVATION AND MONITORING SYSTEMS (QUGOMS'11), 2015, 140 : 3 - 8
  • [44] Exploratory Data Analysis using Artificial Neural Networks
    Sriram, D.
    Kalaivani, K.
    Ulagapriya, K.
    Saritha, A.
    Sajeevram, A.
    PROCEEDINGS OF 2020 IEEE INTERNATIONAL CONFERENCE ON ADVANCES AND DEVELOPMENTS IN ELECTRICAL AND ELECTRONICS ENGINEERING (ICADEE), 2020, : 186 - 195
  • [45] Adaptive Sampling for WSAN Control Applications Using Artificial Neural Networks
    Nkwogu, Daniel N.
    Allen, Alastair R.
    JOURNAL OF SENSOR AND ACTUATOR NETWORKS, 2012, 1 (03): : 299 - 320
  • [46] Intelligent Data Analysis using artificial neural networks
    Hsu, SY
    Zhu, W
    RESEARCH QUARTERLY FOR EXERCISE AND SPORT, 2002, 73 (01) : A38 - A39
  • [47] The influence of sampling on landslide susceptibility mapping using artificial neural networks
    Gameiro, Samuel
    de Oliveira, Guilherme Garcia
    Guasselli, Laurindo Antonio
    GEOCARTO INTERNATIONAL, 2022,
  • [48] Applying two-stage SOM-based clustering approaches to industrial data analysis
    Canetta, L
    Cheikhrouhou, N
    Glardon, R
    PRODUCTION PLANNING & CONTROL, 2005, 16 (08) : 774 - 784
  • [49] A SOM-based data mining strategy for adaptive modelling of an offset lithographic printing process
    Englund, C.
    Verikas, A.
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2007, 20 (03) : 391 - 400
  • [50] The effect of data sampling on the performance evaluation of artificial neural networks in medical diagnosis
    Tourassi, GD
    Floyd, CE
    MEDICAL DECISION MAKING, 1997, 17 (02) : 186 - 192