Data splitting for artificial neural networks using SOM-based stratified sampling

被引:167
|
作者
May, R. J. [1 ]
Maier, H. R. [2 ]
Dandy, G. C. [2 ]
机构
[1] United Water, Res & Dev, Adelaide, SA 5001, Australia
[2] Univ Adelaide, Sch Civil Environm & Mining Engn, Adelaide, SA 5005, Australia
关键词
Artificial neural networks; Data splitting; Cross-validation; Self-organizing maps; Stratified sampling; VALIDATION; PREDICTION; SELECTION; VARIANCE; MODELS; BIAS;
D O I
10.1016/j.neunet.2009.11.009
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Data splitting is an important consideration during artificial neural network (ANN) development where hold-out cross-validation is commonly employed to ensure generalization. Even for a moderate sample size, the sampling methodology used for data splitting can have a significant effect on the quality of the Subsets used for training, testing and validating an ANN. Poor data splitting can result in inaccurate and highly variable model performance; however, the choice of sampling methodology is rarely given due consideration by ANN modellers. Increased confidence in the sampling is of paramount importance, since the hold-out sampling is generally performed only once during ANN development. This paper considers the variability in the quality of subsets that are obtained using different data splitting approaches. A novel approach to stratified sampling, based on Neyman sampling of the self-organizing map (SOM), is developed, with several guidelines identified for setting the SOM size and sample allocation in order to minimize the bias and variance in the datasets. Using an example ANN function approximation task, the SOM-based approach is evaluated in comparison to random sampling, DUPLEX, systematic stratified sampling, and trial-and-error sampling to minimize the statistical differences between data sets. Of these approaches, DUPLEX is found to provide benchmark performance with good model performance, with no variability. The results show that the SOM-based approach also reliably generates high-quality samples and can therefore be used with greater confidence than other approaches, especially in the case of non-uniform datasets, with the benefit of scalability to perform data splitting on large datasets. (C) 2009 Elsevier Ltd. All rights reserved.
引用
收藏
页码:283 / 294
页数:12
相关论文
共 50 条
  • [21] Universal Fault Detection for NFV using SOM-based Clustering
    Niwa, Tomonobu
    Miyazawa, Masanori
    Hayashi, Michiaki
    Stadler, Rolf
    2015 17TH ASIA-PACIFIC NETWORK OPERATIONS AND MANAGEMENT SYMPOSIUM APNOMS, 2015, : 315 - 320
  • [22] Accuracy Improvement of SOM-based Data Classification for Hematopoietic Tumor Patients
    Kamiura, Naotake
    Saitoh, Ayumu
    Isokawa, Teijiro
    Matsui, Nobuyuki
    2009 9TH INTERNATIONAL CONFERENCE ON INTELLIGENT SYSTEMS DESIGN AND APPLICATIONS, 2009, : 373 - 378
  • [23] Learning Robot Control Using a Hierarchical SOM-Based Encoding
    Pierris, Georgios
    Dahl, Torbjorn S.
    IEEE TRANSACTIONS ON COGNITIVE AND DEVELOPMENTAL SYSTEMS, 2017, 9 (01) : 30 - 43
  • [24] Privacy-preserving SOM-based recommendations on horizontally distributed data
    Kaleli, Cihan
    Polat, Huseyin
    KNOWLEDGE-BASED SYSTEMS, 2012, 33 : 124 - 135
  • [25] An Improved SOM-based Visualization Technique for DNA Microarray Data Analysis
    Patra, Jagdish C.
    Abraham, Jacob
    Meher, Pramod K.
    Chakraborty, Goutam
    2010 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS IJCNN 2010, 2010,
  • [26] Musical symbol recognition using SOM-based fuzzy systems
    Su, MC
    Tew, CY
    Chen, HH
    JOINT 9TH IFSA WORLD CONGRESS AND 20TH NAFIPS INTERNATIONAL CONFERENCE, PROCEEDINGS, VOLS. 1-5, 2001, : 2150 - 2153
  • [27] A Comparative Evaluation of SOM-based Anomaly Detection Methods for Multivariate Data
    Guo, Bingjun
    Song, Lei
    Zheng, Taisheng
    Liang, Haoran
    Wang, Hongfei
    2019 PROGNOSTICS AND SYSTEM HEALTH MANAGEMENT CONFERENCE (PHM-QINGDAO), 2019,
  • [28] Thoracic Surgery Patients Data Analysis Using SOM Neural Networks
    Orjuela-Canon, A. D.
    Gomez-Cajas, D. F.
    VI LATIN AMERICAN CONGRESS ON BIOMEDICAL ENGINEERING (CLAIB 2014), 2014, 49 : 761 - 764
  • [29] Clustering Analysis of Gene Data Based on PCA and SOM Neural Networks
    Zhao Anke
    Qiang Xinjian
    Cheng Guojian
    2014 Fifth International Conference on Intelligent Systems Design and Engineering Applications (ISDEA), 2014, : 284 - 287
  • [30] SOM-based recommendations with privacy on multi-party vertically distributed data
    Kaleli, C.
    Polat, H.
    JOURNAL OF THE OPERATIONAL RESEARCH SOCIETY, 2012, 63 (06) : 826 - 838