Modelling species presence-only data with random forests

被引:110
|
作者
Valavi, Roozbeh [1 ]
Elith, Jane [1 ]
Lahoz-Monfort, Jose J. [1 ]
Guillera-Arroita, Gurutzeta [1 ]
机构
[1] Univ Melbourne, Sch Biosci, Parkville, Vic, Australia
关键词
class imbalance; class overlap; down-sampling; ecological niche model; presence-background; recursive partitioning; POINT PROCESS MODELS; STATISTICAL COMPARISONS; DECISION TREES; CLASSIFICATION; REGRESSION; BIAS; DISTRIBUTIONS; CLASSIFIERS; PERFORMANCE;
D O I
10.1111/ecog.05615
中图分类号
X176 [生物多样性保护];
学科分类号
090705 ;
摘要
The random forest (RF) algorithm is an ensemble of classification or regression trees and is widely used, including for species distribution modelling (SDM). Many researchers use implementations of RF in the R programming language with default parameters to analyse species presence-only data together with 'background' samples. However, there is good evidence that RF with default parameters does not perform well for such 'presence-background' modelling. This is often attributed to the disparity between the number of presence and background samples, also known as 'class imbalance', and several solutions have been proposed. Here, we first set the context: the background sample should be large enough to represent all environments in the region. We then aim to understand the drivers of poor performance of RF when models are fitted to presence-only species data alongside background samples. We show that 'class overlap' (where both classes occur in the same environment) is an important driver of poor performance, alongside class imbalance. Class overlap can even degrade performance for presence-absence data. We explain, test and evaluate suggested solutions. Using simulated and real presence-background data, we compare performance of default RF with other weighting and sampling approaches. Our results demonstrate clear evidence of improvement in the performance of RFs when techniques that explicitly manage imbalance are used. We show that these either limit or enforce tree depth. Without compromising the environmental representativeness of the sampled background, we identify approaches to fitting RF that ameliorate the effects of imbalance and overlap and allow excellent predictive performance. Understanding the problems of RF in presence-background modelling allows new insights into how best to fit models, and should guide future efforts to best deal with such data.
引用
收藏
页码:1731 / 1742
页数:12
相关论文
共 50 条
  • [1] The use of classification and regression algorithms using the random forests method with presence-only data to model species' distribution
    Zhang, Lei
    Huettmann, Falk
    Zhang, Xudong
    Liu, Shirong
    Sun, Pengsen
    Yu, Zhen
    Mi, Chunrong
    METHODSX, 2019, 6 : 2281 - 2292
  • [2] Modelling distribution and abundance with presence-only data
    Pearce, Jennie L.
    Boyce, Mark S.
    JOURNAL OF APPLIED ECOLOGY, 2006, 43 (03) : 405 - 412
  • [3] Efficient Modelling of Presence-Only Species Data via Local Background Sampling
    Daniel, Jeffrey
    Horrocks, Julie
    Umphrey, Gary J.
    JOURNAL OF AGRICULTURAL BIOLOGICAL AND ENVIRONMENTAL STATISTICS, 2020, 25 (01) : 90 - 111
  • [4] Efficient Modelling of Presence-Only Species Data via Local Background Sampling
    Jeffrey Daniel
    Julie Horrocks
    Gary J. Umphrey
    Journal of Agricultural, Biological and Environmental Statistics, 2020, 25 : 90 - 111
  • [5] Species Distribution Modelling: Contrasting presence-only models with plot abundance data
    Vitor H. F. Gomes
    Stéphanie D. IJff
    Niels Raes
    Iêda Leão Amaral
    Rafael P. Salomão
    Luiz de Souza Coelho
    Francisca Dionízia de Almeida Matos
    Carolina V. Castilho
    Diogenes de Andrade Lima Filho
    Dairon Cárdenas López
    Juan Ernesto Guevara
    William E. Magnusson
    Oliver L. Phillips
    Florian Wittmann
    Marcelo de Jesus Veiga Carim
    Maria Pires Martins
    Mariana Victória Irume
    Daniel Sabatier
    Jean-François Molino
    Olaf S. Bánki
    José Renan da Silva Guimarães
    Nigel C. A. Pitman
    Maria Teresa Fernandez Piedade
    Abel Monteagudo Mendoza
    Bruno Garcia Luize
    Eduardo Martins Venticinque
    Evlyn Márcia Moraes de Leão Novo
    Percy Núñez Vargas
    Thiago Sanna Freire Silva
    Angelo Gilberto Manzatto
    John Terborgh
    Neidiane Farias Costa Reis
    Juan Carlos Montero
    Katia Regina Casula
    Beatriz S. Marimon
    Ben-Hur Marimon
    Euridice N. Honorio Coronado
    Ted R. Feldpausch
    Alvaro Duque
    Charles Eugene Zartman
    Nicolás Castaño Arboleda
    Timothy J. Killeen
    Bonifacio Mostacedo
    Rodolfo Vasquez
    Jochen Schöngart
    Rafael L. Assis
    Marcelo Brilhante Medeiros
    Marcelo Fragomeni Simon
    Ana Andrade
    William F. Laurance
    Scientific Reports, 8
  • [6] Species Distribution Modelling: Contrasting presence-only models with plot abundance data
    Gomes, Vitor H. F.
    Ijff, Stephanie D.
    Raes, Niels
    Amaral, Ieda Leao
    Salomao, Rafael P.
    Coelho, Luiz de Souza
    de Almeida Matos, Francisca Dionizia
    Castilho, Carolina V.
    Lima Filho, Diogenes de Andrade
    Cardenas Lopez, Dairon
    Ernesto Guevara, Juan
    Magnusson, William E.
    Phillips, Oliver L.
    Wittmann, Florian
    Veiga Carim, Marcelo de Jesus
    Martins, Maria Pires
    Irume, Mariana Victoria
    Sabatier, Daniel
    Molino, Jean-Francois
    Banki, Olaf S.
    da Silva Guimaraes, Jose Renan
    Pitman, Nigel C. A.
    Fernandez Piedade, Maria Teresa
    Mendoza, Abel Monteagudo
    Luize, Bruno Garcia
    Venticinque, Eduardo Martins
    Moraes de Leao Novo, Evlyn Marcia
    Vargas, Percy Nunez
    Freire Silva, Thiago Sanna
    Manzatto, Angelo Gilberto
    Terborgh, John
    Costa Reis, Neidiane Farias
    Montero, Juan Carlos
    Casula, Katia Regina
    Marimon, Beatriz S.
    Marimon, Ben-Hur
    Honorio Coronado, Euridice N.
    Feldpausch, Ted R.
    Duque, Alvaro
    Zartman, Charles Eugene
    Arboleda, Nicolas Castano
    Killeen, Timothy J.
    Mostacedo, Bonifacio
    Vasquez, Rodolfo
    Schongart, Jochen
    Assis, Rafael L.
    Medeiros, Marcelo Brilhante
    Simon, Marcelo Fragomeni
    Andrade, Ana
    Laurance, William F.
    SCIENTIFIC REPORTS, 2018, 8
  • [7] Data Augmentation Approach in Bayesian Modelling of Presence-only Data
    Divino, F.
    Golini, N.
    Lasinio, G. Jona
    Penttinen, A.
    SPATIAL STATISTICS 2011: MAPPING GLOBAL CHANGE, 2011, 7 : 38 - 43
  • [8] Likelihood analysis of species occurrence probability from presence-only data for modelling species distributions
    Royle, J. Andrew
    Chandler, Richard B.
    Yackulic, Charles
    Nichols, James D.
    METHODS IN ECOLOGY AND EVOLUTION, 2012, 3 (03): : 545 - 554
  • [9] A comparative evaluation of presence-only methods for modelling species distribution
    Tsoar, Asaf
    Allouche, Omri
    Steinitz, Ofer
    Rotem, Dotan
    Kadmon, Ronen
    DIVERSITY AND DISTRIBUTIONS, 2007, 13 (04) : 397 - 405
  • [10] Classification and regression with random forests as a standard method for presence-only data SDMs: A future conservation example using China tree species
    Zhang, Lei
    Huettmann, Falk
    Liu, Shirong
    Sun, Pengsen
    Yu, Zhen
    Zhang, Xudong
    Mi, Chunrong
    ECOLOGICAL INFORMATICS, 2019, 52 : 46 - 56