Modelling species presence-only data with random forests

被引:110
|
作者
Valavi, Roozbeh [1 ]
Elith, Jane [1 ]
Lahoz-Monfort, Jose J. [1 ]
Guillera-Arroita, Gurutzeta [1 ]
机构
[1] Univ Melbourne, Sch Biosci, Parkville, Vic, Australia
关键词
class imbalance; class overlap; down-sampling; ecological niche model; presence-background; recursive partitioning; POINT PROCESS MODELS; STATISTICAL COMPARISONS; DECISION TREES; CLASSIFICATION; REGRESSION; BIAS; DISTRIBUTIONS; CLASSIFIERS; PERFORMANCE;
D O I
10.1111/ecog.05615
中图分类号
X176 [生物多样性保护];
学科分类号
090705 ;
摘要
The random forest (RF) algorithm is an ensemble of classification or regression trees and is widely used, including for species distribution modelling (SDM). Many researchers use implementations of RF in the R programming language with default parameters to analyse species presence-only data together with 'background' samples. However, there is good evidence that RF with default parameters does not perform well for such 'presence-background' modelling. This is often attributed to the disparity between the number of presence and background samples, also known as 'class imbalance', and several solutions have been proposed. Here, we first set the context: the background sample should be large enough to represent all environments in the region. We then aim to understand the drivers of poor performance of RF when models are fitted to presence-only species data alongside background samples. We show that 'class overlap' (where both classes occur in the same environment) is an important driver of poor performance, alongside class imbalance. Class overlap can even degrade performance for presence-absence data. We explain, test and evaluate suggested solutions. Using simulated and real presence-background data, we compare performance of default RF with other weighting and sampling approaches. Our results demonstrate clear evidence of improvement in the performance of RFs when techniques that explicitly manage imbalance are used. We show that these either limit or enforce tree depth. Without compromising the environmental representativeness of the sampled background, we identify approaches to fitting RF that ameliorate the effects of imbalance and overlap and allow excellent predictive performance. Understanding the problems of RF in presence-background modelling allows new insights into how best to fit models, and should guide future efforts to best deal with such data.
引用
收藏
页码:1731 / 1742
页数:12
相关论文
共 50 条
  • [31] Range bagging: a new method for ecological niche modelling from presence-only data
    Drake, John M.
    JOURNAL OF THE ROYAL SOCIETY INTERFACE, 2015, 12 (107)
  • [32] Bayesian mixture models and their Big Data implementations with application to invasive species presence-only data
    Insha Ullah
    Kerrie Mengersen
    Journal of Big Data, 6
  • [33] Bayesian mixture models and their Big Data implementations with application to invasive species presence-only data
    Ullah, Insha
    Mengersen, Kerrie
    JOURNAL OF BIG DATA, 2019, 6 (01)
  • [34] On the existence of maximum likelihood estimates for presence-only data
    Hefley, Trevor J.
    Hooten, Mevin B.
    METHODS IN ECOLOGY AND EVOLUTION, 2015, 6 (06): : 648 - 655
  • [35] itsdm: Isolation forest-based presence-only species distribution modelling and explanation in r
    Song, Lei
    Estes, Lyndon
    METHODS IN ECOLOGY AND EVOLUTION, 2023, 14 (03): : 831 - 840
  • [36] Predicting the Geographic Distribution of a Species from Presence-Only Data Subject to Detection Errors
    Dorazio, Robert M.
    BIOMETRICS, 2012, 68 (04) : 1303 - 1312
  • [37] Preferential sampling for presence/absence data and for fusion of presence/absence data with presence-only data
    Gelfand, Alan E.
    Shirota, Shinichiro
    ECOLOGICAL MONOGRAPHS, 2019, 89 (03)
  • [38] Nondetection sampling bias in marked presence-only data
    Hefley, Trevor J.
    Tyre, Andrew J.
    Baasch, David M.
    Blankenship, Erin E.
    ECOLOGY AND EVOLUTION, 2013, 3 (16): : 5225 - 5236
  • [39] Inference from presence-only data; the ongoing controversy
    Hastie, Trevor
    Fithian, Will
    ECOGRAPHY, 2013, 36 (08) : 864 - 867
  • [40] Correction of location errors for presence-only species distribution models
    Hefley, Trevor J.
    Baasch, David M.
    Tyre, Andrew J.
    Blankenship, Erin E.
    METHODS IN ECOLOGY AND EVOLUTION, 2014, 5 (03): : 207 - 214