Bioprocess data mining using regularized regression and random forests

被引:18
|
作者
Hassan, Syeda Sakira [1 ]
Farhan, Muhammad [1 ]
Mangayil, Rahul [2 ]
Huttunen, Heikki [1 ]
Aho, Tommi [2 ]
机构
[1] Tampere Univ Technol, Dept Signal Proc, FIN-33101 Tampere, Finland
[2] Tampere Univ Technol, Dept Chem & Bioengn, FIN-33101 Tampere, Finland
基金
芬兰科学院;
关键词
FEATURE-SELECTION;
D O I
10.1186/1752-0509-7-S1-S5
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Background: In bioprocess development, the needs of data analysis include (1) getting overview to existing data sets, (2) identifying primary control parameters, (3) determining a useful control direction, and (4) planning future experiments. In particular, the integration of multiple data sets causes that these needs cannot be properly addressed by regression models that assume linear input-output relationship or unimodality of the response function. Regularized regression and random forests, on the other hand, have several properties that may appear important in this context. They are capable, e. g., in handling small number of samples with respect to the number of variables, feature selection, and the visualization of response surfaces in order to present the prediction results in an illustrative way. Results: In this work, the applicability of regularized regression (Lasso) and random forests (RF) in bioprocess data mining was examined, and their performance was benchmarked against multiple linear regression. As an example, we used data from a culture media optimization study for microbial hydrogen production. All the three methods were capable in providing a significant model when the five variables of the culture media optimization were linearly included in modeling. However, multiple linear regression failed when also the multiplications and squares of the variables were included in modeling. In this case, the modeling was still successful with Lasso (correlation between the observed and predicted yield was 0.69) and RF (0.91). Conclusion: We found that both regularized regression and random forests were able to produce feasible models, and the latter was efficient in capturing the non-linearity in the data. In this kind of a data mining task of bioprocess data, both methods outperform multiple linear regression.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Mining Big Data with Random Forests
    Lulli, Alessandro
    Oneto, Luca
    Anguita, Davide
    [J]. COGNITIVE COMPUTATION, 2019, 11 (02) : 294 - 316
  • [2] Mining Big Data with Random Forests
    Alessandro Lulli
    Luca Oneto
    Davide Anguita
    [J]. Cognitive Computation, 2019, 11 : 294 - 316
  • [3] Random forests regression for soft interval data
    Gaona-Partida, Paul
    Yeh, Chih-Ching
    Sun, Yan
    Cutler, Adele
    [J]. COMMUNICATIONS IN STATISTICS-SIMULATION AND COMPUTATION, 2024,
  • [4] Data mining for bioprocess optimization
    Rommel, S
    Schuppert, A
    [J]. ENGINEERING IN LIFE SCIENCES, 2004, 4 (03): : 266 - 270
  • [5] Data-driven switching modeling for MPC using Regression Trees and Random Forests
    Smarra, Francesco
    Di Girolamo, Giovanni Domenico
    De Iuliis, Vittorio
    Jain, Achin
    Mangharam, Rahul
    D'Innocenzo, Alessandro
    [J]. NONLINEAR ANALYSIS-HYBRID SYSTEMS, 2020, 36
  • [6] Improvement of rainfall estimation from MSG data using Random Forests classification and regression
    Ouallouche, Fethi
    Lazri, Mourad
    Ameur, Soltane
    [J]. ATMOSPHERIC RESEARCH, 2018, 211 : 62 - 72
  • [7] Pathway analysis using random forests classification and regression
    Pang, Herbert
    Lin, Aiping
    Holford, Matthew
    Enerson, Bradley E.
    Lu, Bin
    Lawton, Michael P.
    Floyd, Eugenia
    Zhao, Hongyu
    [J]. BIOINFORMATICS, 2006, 22 (16) : 2028 - 2036
  • [8] AGE REGRESSION FROM FACES USING RANDOM FORESTS
    Montilla, Albert
    Ling, Haibin
    [J]. 2009 16TH IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOLS 1-6, 2009, : 2465 - +
  • [9] Time Series Quantile Regression Using Random Forests
    Shiraishi, Hiroshi
    Nakamura, Tomoshige
    Shibuki, Ryotato
    [J]. JOURNAL OF TIME SERIES ANALYSIS, 2024, 45 (04) : 639 - 659
  • [10] The use of random forests regression for estimating prognosis with survival data
    Royston, P
    [J]. CONTROLLED CLINICAL TRIALS, 2003, 24 : 198S - 198S