Data splitting for artificial neural networks using SOM-based stratified sampling

被引:167
|
作者
May, R. J. [1 ]
Maier, H. R. [2 ]
Dandy, G. C. [2 ]
机构
[1] United Water, Res & Dev, Adelaide, SA 5001, Australia
[2] Univ Adelaide, Sch Civil Environm & Mining Engn, Adelaide, SA 5005, Australia
关键词
Artificial neural networks; Data splitting; Cross-validation; Self-organizing maps; Stratified sampling; VALIDATION; PREDICTION; SELECTION; VARIANCE; MODELS; BIAS;
D O I
10.1016/j.neunet.2009.11.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the Subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:283 / 294
页数:12
相关论文
共 50 条
  • [31] Sampling Frequency Influence at Fault Locations Using Algorithms Based on Artificial Neural Networks
    Silva, J. A. C. B.
    Silva, K. M.
    Neves, W. L. A.
    Souza, B. A.
    Costa, F. B.
    PROCEEDINGS OF THE 2012 FOURTH WORLD CONGRESS ON NATURE AND BIOLOGICALLY INSPIRED COMPUTING (NABIC), 2012, : 15 - 19
  • [32] DATA CLASSIFICATION BASED ON ARTIFICIAL NEURAL NETWORKS
    Gu, Xiao-Feng
    Liu, Lin
    Li, Jian-Ping
    Huang, Yuan-Yuan
    Lin, Jie
    2008 INTERNATIONAL CONFERENCE ON APPERCEIVING COMPUTING AND INTELLIGENCE ANALYSIS (ICACIA 2008), 2008, : 223 - 226
  • [33] Interactive SOM-based gene grouping: An approach to gene expression data analysis
    Gruzdz, A
    Ihnatowicz, A
    Slezak, D
    FOUNDATIONS OF INTELLIGENT SYSTEMS, PROCEEDINGS, 2005, 3488 : 514 - 523
  • [34] SOM-Based Techniques towards Hierarchical Visualisation of Network Forensics Traffic Data
    Palomo, E. J.
    Elizondo, D.
    Dominguez, E.
    Luque, R. M.
    Watson, Tim
    COMPUTATIONAL INTELLIGENCE FOR PRIVACY AND SECURITY, 2012, 394 : 75 - +
  • [35] Efficient video compression codebooks using SOM-based vector quantisation
    Ferguson, KL
    Allinson, NM
    IEE PROCEEDINGS-VISION IMAGE AND SIGNAL PROCESSING, 2004, 151 (02): : 102 - 108
  • [36] Humanoid Tactile Gesture Production using a Hierarchical SOM-based Encoding
    Pierris, Georgios
    Dahl, Torbjorn S.
    IEEE TRANSACTIONS ON AUTONOMOUS MENTAL DEVELOPMENT, 2014, 6 (02) : 153 - 167
  • [37] ShinySOM: graphical SOM-based analysis of single-cell cytometry data
    Kratochvil, Miroslav
    Bednarek, David
    Sieger, Tomas
    Fiser, Karel
    Vondrasek, Jiri
    BIOINFORMATICS, 2020, 36 (10) : 3288 - 3289
  • [38] SOM-based Visualization for Classifying Large-scale Sensing Data of Moonquakes
    Goto, Yasumichi
    Yamada, Ryuhei
    Yamamoto, Yukio
    Yokoyama, Shohei
    Ishikawa, Hiroshi
    2013 EIGHTH INTERNATIONAL CONFERENCE ON P2P, PARALLEL, GRID, CLOUD AND INTERNET COMPUTING (3PGCIC 2013), 2013, : 630 - 634
  • [39] A New SOM-based Active Contour Model using Conscience and Archiving Mechanisms
    Sadeghi, Fereshteh
    Izadinia, Hamid
    Safabakhsh, Reza
    11TH INTERNATIONAL CONFERENCE ON CONTROL, AUTOMATION, ROBOTICS AND VISION (ICARCV 2010), 2010, : 1219 - 1224
  • [40] Fast winner search for SOM-based monitoring and retrieval of high-dimensional data
    Kaski, S
    NINTH INTERNATIONAL CONFERENCE ON ARTIFICIAL NEURAL NETWORKS (ICANN99), VOLS 1 AND 2, 1999, (470): : 940 - 945