Handling missing values in trait data

被引:77
|
作者
Johnson, Thomas F. [1 ]
Isaac, Nick J. B. [2 ]
Paviolo, Agustin [3 ,4 ]
Gonzalez-Suarez, Manuela [1 ]
机构
[1] Univ Reading, Sch Biol Sci, Ecol & Evolutionary Biol, Reading RG6 6UR, Berks, England
[2] Ctr Ecol & Hydrol, Biodivers Sci Area, Wallingford, Oxon, England
[3] CONICET Univ Nacl Misiones, Inst Biol Subtrop, Misiones, Argentina
[4] Assoc Civil Ctr Invest Bosque Atlantico, Misiones, Argentina
来源
GLOBAL ECOLOGY AND BIOGEOGRAPHY | 2021年 / 30卷 / 01期
基金
英国自然环境研究理事会;
关键词
BHPMF; functional trait; imputation; life-history trait; MAR; MCAR; missing data; MNAR; multiple imputation chained equations; Rphylopars; PHYLOGENETIC IMPUTATION; PREDICTION; SUCCESS;
D O I
10.1111/geb.13185
中图分类号
Q14 [生态学(生物生态学)];
学科分类号
071012 ; 0713 ;
摘要
Aim Trait data are widely used in ecological and evolutionary phylogenetic comparative studies, but often values are not available for all species of interest. Traditionally, researchers have excluded species without data from analyses, but estimation of missing values using imputation has been proposed as a better approach. However, imputation methods have largely been designed for randomly missing data, whereas trait data are often not missing at random (e.g., more data for bigger species). Here, we evaluate the performance of approaches for handling missing values when considering biased datasets. Location Any. Time period Any. Major taxa studied Any. Methods We simulated continuous traits and separate response variables to test the performance of nine imputation methods and complete-case analysis (excluding missing values from the dataset) under biased missing data scenarios. We characterized performance by estimating the error in imputed trait values (deviation from the true value) and inferred trait-response relationships (deviation from the true relationship between a trait and response). Results Generally,Rphyloparsimputation produced the most accurate estimate of missing values and best preserved the response-trait slope. However, estimates of missing data were still inaccurate, even with only 5% of values missing. Under severe biases, errors were high with every approach. Imputation was not always the best option, with complete-case analysis frequently outperformingMiceimputation and, to a lesser degree,BHPMFimputation.Mice, a popular approach, performed poorly when the response variable was excluded from the imputation model. Main conclusions Imputation can handle missing data effectively in some conditions but is not always the best solution. None of the methods we tested could deal effectively with severe biases, which can be common in trait datasets. We recommend rigorous data checking for biases before and after imputation and propose variables that can assist researchers working with incomplete datasets to detect data biases and minimize errors.
引用
收藏
页码:51 / 62
页数:12
相关论文
共 50 条
  • [1] A primer for handling missing values in the analysis of education and training data
    Gemici, Sinan
    Bednarz, Alice
    Lim, Patrick
    [J]. INTERNATIONAL JOURNAL OF TRAINING RESEARCH, 2012, 10 (03): : 233 - 250
  • [2] missMDA: A Package for Handling Missing Values in Multivariate Data Analysis
    Josse, Julie
    Husson, Francois
    [J]. JOURNAL OF STATISTICAL SOFTWARE, 2016, 70 (01):
  • [3] Handling Missing Values in Longitudinal Panel Data With Multiple Imputation
    Young, Rebekah
    Johnson, David R.
    [J]. JOURNAL OF MARRIAGE AND FAMILY, 2015, 77 (01) : 277 - 294
  • [4] Handling missing attribute values in preterm birth data sets
    Grzymala-Busse, JW
    Goodwin, LK
    Grzymala-Busse, WJ
    Zheng, XQ
    [J]. ROUGH SETS, FUZZY SETS, DATA MINING, AND GRANULAR COMPUTING, PT 2, PROCEEDINGS, 2005, 3642 : 342 - 351
  • [5] Handling missing values in exploratory multivariate data analysis methods
    Josse, Julie
    Husson, Francois
    [J]. JOURNAL OF THE SFDS, 2012, 153 (02): : 79 - 99
  • [6] Handling missing values in kernel methods with application to microbiology data
    Belanche, Lluis A.
    Kobayashi, Vladimer
    Aluja, Tomas
    [J]. NEUROCOMPUTING, 2014, 141 : 110 - 116
  • [7] Scalable Data Quality for Big Data: The Pythia Framework for Handling Missing Values
    Cahsai, Atoshum
    Anagnostopoulos, Christos
    Triantafillou, Peter
    [J]. BIG DATA, 2015, 3 (03) : 159 - 172
  • [8] A Review of Missing Values Handling Methods on Time-Series Data
    Pratama, Irfan
    Permanasari, Adhistya Erna
    Ardiyanto, Igi
    Indrayani, Rini
    [J]. PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON INFORMATION TECHNOLOGY SYSTEMS AND INNOVATION (ICITSI), 2016,
  • [9] Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
    Kieu Trinh Do
    Simone Wahl
    Johannes Raffler
    Sophie Molnos
    Michael Laimighofer
    Jerzy Adamski
    Karsten Suhre
    Konstantin Strauch
    Annette Peters
    Christian Gieger
    Claudia Langenberg
    Isobel D. Stewart
    Fabian J. Theis
    Harald Grallert
    Gabi Kastenmüller
    Jan Krumsiek
    [J]. Metabolomics, 2018, 14
  • [10] Characterization of missing values in untargeted MS-based metabolomics data and evaluation of missing data handling strategies
    Kieu Trinh Do
    Wahl, Simone
    Raffler, Johannes
    Molnos, Sophie
    Laimighofer, Michael
    Adamski, Jerzy
    Suhre, Karsten
    Strauch, Konstantin
    Peters, Annette
    Gieger, Christian
    Langenberg, Claudia
    Stewart, Isobel D.
    Theis, Fabian J.
    Grallert, Harald
    Kastenmueller, Gabi
    Krumsiek, Jan
    [J]. METABOLOMICS, 2018, 14 (10)