On the Importance of Data Balancing for Symbolic Regression

被引:28
|
作者
Vladislavleva, Ekaterina [1 ]
Smits, Guido [2 ]
den Hertog, Dick [3 ]
机构
[1] Univ Antwerp, Dept Math & Comp Sci, B-2000 Antwerp, Belgium
[2] Dow Benelux BV, Core Res & Dev Dept, NL-4530 Terneuzen, Netherlands
[3] Tilburg Univ, Dept Econometr & Operat Res, Fac Econ & Business Adm, NL-5000 LE Tilburg, Netherlands
关键词
Compression; data balancing; data scoring; data weighting; fitting; genetic programming; information content; modeling; subset selection; symbolic regression; OUTLIERS;
D O I
10.1109/TEVC.2009.2029697
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Symbolic regression of input-output data conventionally treats data records equally. We suggest a framework for automatic assignment of weights to data samples, which takes into account the sample's relative importance. In this paper, we study the possibilities of improving symbolic regression on real-life data by incorporating weights into the fitness function. We introduce four weighting schemes de. ning the importance of a point relative to proximity, surrounding, remoteness, and nonlinear deviation from k nearest-in-the-input-space neighbors. For enhanced analysis and modeling of large imbalanced data sets we introduce a simple multidimensional iterative technique for subsampling. This technique allows a sensible partitioning (and compression) of data to nested subsets of an arbitrary size in such a way that the subsets are balanced with respect to either of the presented weighting schemes. For cases where a given input-output data set contains some redundancy, we suggest an approach to considerably improve the effectiveness of regression by applying more modeling effort to a smaller subset of the data set that has a similar information content. Such improvement is achieved due to better exploration of the search space of potential solutions at the same number of function evaluations. We compare different approaches to regression on five benchmark problems with a fixed budget allocation. We demonstrate that the significant improvement in the quality of the regression models can be obtained either with the weighted regression, exploratory regression using a compressed subset with a similar information content, or exploratory weighted regression on the compressed subset, which is weighted with one of the proposed weighting schemes.
引用
收藏
页码:252 / 277
页数:26
相关论文
共 50 条
  • [41] Data-driven Symbolic Regression for Identification of Nonlinear Dynamics in Power Systems
    Stankovic, Alex M.
    Saric, Aleksandar A.
    Saric, Andrija T.
    Transtrum, Mark K.
    2020 IEEE POWER & ENERGY SOCIETY GENERAL MEETING (PESGM), 2020,
  • [42] Interpretability in symbolic regression: a benchmark of explanatory methods using the Feynman data set
    Guilherme Seidyo Imai Aldeia
    Fabrício Olivetti de França
    Genetic Programming and Evolvable Machines, 2022, 23 : 309 - 349
  • [43] Identifying interactions in omics data for clinical biomarker discovery using symbolic regression
    Christensen, Niels Johan
    Demharter, Samuel
    Machado, Meera
    Pedersen, Lykke
    Salvatore, Marco
    Stentoft-Hansen, Valdemar
    Iglesias, Miquel Triana
    BIOINFORMATICS, 2022, 38 (15) : 3749 - 3758
  • [44] Interpretability in symbolic regression: a benchmark of explanatory methods using the Feynman data set
    Imai Aldeia, Guilherme Seidyo
    de Franca, Fabricio Olivetti
    GENETIC PROGRAMMING AND EVOLVABLE MACHINES, 2022, 23 (03) : 309 - 349
  • [45] Harnessing data using symbolic regression methods for discovering novel paradigms in physics
    Jianyang Guo
    Wan-Jian Yin
    Science China(Physics,Mechanics & Astronomy), 2024, (06) : 5 - 15
  • [46] Study a decay and proton emission based on data-driven symbolic regression *
    Cheng, Junhao
    Wang, Binglin
    Zhang, Wenyu
    Duan, Xiaojun
    Yu, Tongpu
    COMPUTER PHYSICS COMMUNICATIONS, 2024, 304
  • [47] Symbolic Regression for Data-Driven Dynamic Model Refinement in Power Systems
    Saric, Andrija T.
    Saric, Aleksandar A.
    Transtrum, Mark K.
    Stankovic, Aleksandar M.
    IEEE TRANSACTIONS ON POWER SYSTEMS, 2021, 36 (03) : 2390 - 2402
  • [48] Data-driven HVAC Control Using Symbolic Regression: Design and Implementation
    Ozawa, Yuki
    Zhao, Dafang
    Watari, Daichi
    Taniguchi, Ittetsu
    Suzuki, Toshihiro
    Shimoda, Yoshiyuki
    Onoye, Takao
    2023 IEEE POWER & ENERGY SOCIETY GENERAL MEETING, PESGM, 2023,
  • [49] Centre and Range method for fitting a linear regression model to symbolic interval data
    Lima Neto, Eufrasio de A.
    de Carvalho, Francisco de A. T.
    COMPUTATIONAL STATISTICS & DATA ANALYSIS, 2008, 52 (03) : 1500 - 1515
  • [50] Harnessing data using symbolic regression methods for discovering novel paradigms in physics
    Guo, Jianyang
    Yin, Wan-Jian
    SCIENCE CHINA-PHYSICS MECHANICS & ASTRONOMY, 2024, 67 (06)