On the Importance of Data Balancing for Symbolic Regression

被引:28
|
作者
Vladislavleva, Ekaterina [1 ]
Smits, Guido [2 ]
den Hertog, Dick [3 ]
机构
[1] Univ Antwerp, Dept Math & Comp Sci, B-2000 Antwerp, Belgium
[2] Dow Benelux BV, Core Res & Dev Dept, NL-4530 Terneuzen, Netherlands
[3] Tilburg Univ, Dept Econometr & Operat Res, Fac Econ & Business Adm, NL-5000 LE Tilburg, Netherlands
关键词
Compression; data balancing; data scoring; data weighting; fitting; genetic programming; information content; modeling; subset selection; symbolic regression; OUTLIERS;
D O I
10.1109/TEVC.2009.2029697
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Symbolic regression of input-output data conventionally treats data records equally. We suggest a framework for automatic assignment of weights to data samples, which takes into account the sample's relative importance. In this paper, we study the possibilities of improving symbolic regression on real-life data by incorporating weights into the fitness function. We introduce four weighting schemes de. ning the importance of a point relative to proximity, surrounding, remoteness, and nonlinear deviation from k nearest-in-the-input-space neighbors. For enhanced analysis and modeling of large imbalanced data sets we introduce a simple multidimensional iterative technique for subsampling. This technique allows a sensible partitioning (and compression) of data to nested subsets of an arbitrary size in such a way that the subsets are balanced with respect to either of the presented weighting schemes. For cases where a given input-output data set contains some redundancy, we suggest an approach to considerably improve the effectiveness of regression by applying more modeling effort to a smaller subset of the data set that has a similar information content. Such improvement is achieved due to better exploration of the search space of potential solutions at the same number of function evaluations. We compare different approaches to regression on five benchmark problems with a fixed budget allocation. We demonstrate that the significant improvement in the quality of the regression models can be obtained either with the weighted regression, exploratory regression using a compressed subset with a similar information content, or exploratory weighted regression on the compressed subset, which is weighted with one of the proposed weighting schemes.
引用
收藏
页码:252 / 277
页数:26
相关论文
共 50 条
  • [31] Weighted Linear Regression for Symbolic Interval-Values Data with Outliers
    Chuang, Chen-Chia
    Wang, Chien-Ming
    Li, Chih-Wen
    ICIEA 2010: PROCEEDINGS OF THE 5TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS, VOL 4, 2010, : 511 - 515
  • [32] Learning dynamics from coarse/noisy data with scalable symbolic regression
    Chen, Zhao
    Wang, Nan
    MECHANICAL SYSTEMS AND SIGNAL PROCESSING, 2023, 190
  • [33] A symbolic data-driven technique based on evolutionary polynomial regression
    Giustolisi, Orazio
    Savic, Dragan A.
    JOURNAL OF HYDROINFORMATICS, 2006, 8 (03) : 207 - 222
  • [34] Data Mining Using Unguided Symbolic Regression on a Blast Furnace Dataset
    Kommenda, Michael
    Kronberger, Gabriel
    Feilmayr, Christoph
    Affenzeller, Michael
    APPLICATIONS OF EVOLUTIONARY COMPUTATION, PT I, 2011, 6624 : 274 - +
  • [35] Generating data sets for teaching the importance of regression analysis
    Murray, Lori L.
    Wilson, John G.
    DECISION SCIENCES-JOURNAL OF INNOVATIVE EDUCATION, 2021, 19 (02) : 157 - 166
  • [36] Smooth Symbolic Regression: Transformation of Symbolic Regression into a Real-Valued Optimization Problem
    Pitzer, Erik
    Kronberger, Gabriel
    COMPUTER AIDED SYSTEMS THEORY - EUROCAST 2015, 2015, 9520 : 375 - 383
  • [37] Shrinkage Regression for Multivariate Inference with Missing Data, and an Application to Portfolio Balancing
    Gramacy, Robert B.
    Pantaleo, Ester
    BAYESIAN ANALYSIS, 2010, 5 (02): : 237 - 262
  • [38] How to Handle Error Bars in Symbolic Regression for Data Mining in Scientific Applications
    Murari, A.
    Peluso, E.
    Gelfusa, M.
    Lungaroni, M.
    Gaudio, P.
    STATISTICAL LEARNING AND DATA SCIENCES, 2015, 9047 : 347 - 355
  • [39] Dynamic System Identification from Scarce and Noisy Data using Symbolic Regression
    Cohen, Benjamin
    Beykal, Burcu
    Bollas, George
    2023 62ND IEEE CONFERENCE ON DECISION AND CONTROL, CDC, 2023, : 3670 - 3675
  • [40] Towards new directions of data mining by evolutionary fuzzy rules and symbolic regression
    Kroemer, P.
    Owais, S.
    Platos, J.
    Snasel, V.
    COMPUTERS & MATHEMATICS WITH APPLICATIONS, 2013, 66 (02) : 190 - 200