Handling missing values in trait data

被引：77

作者：

Johnson, Thomas F. ^{[1
]}

Isaac, Nick J. B. ^{[2
]}

Paviolo, Agustin ^{[3
,4
]}

Gonzalez-Suarez, Manuela ^{[1
]}

机构：

[1] Univ Reading, Sch Biol Sci, Ecol & Evolutionary Biol, Reading RG6 6UR, Berks, England

[2] Ctr Ecol & Hydrol, Biodivers Sci Area, Wallingford, Oxon, England

[3] CONICET Univ Nacl Misiones, Inst Biol Subtrop, Misiones, Argentina

[4] Assoc Civil Ctr Invest Bosque Atlantico, Misiones, Argentina

来源：

GLOBAL ECOLOGY AND BIOGEOGRAPHY | 2021年 / 30卷 / 01期

基金：

英国自然环境研究理事会;

关键词：

BHPMF; functional trait; imputation; life-history trait; MAR; MCAR; missing data; MNAR; multiple imputation chained equations; Rphylopars; PHYLOGENETIC IMPUTATION; PREDICTION; SUCCESS;

D O I：

10.1111/geb.13185

中图分类号：

Q14 [生态学（生物生态学）];

学科分类号：

071012 ; 0713 ;

摘要：

Aim Trait data are widely used in ecological and evolutionary phylogenetic comparative studies, but often values are not available for all species of interest. Traditionally, researchers have excluded species without data from analyses, but estimation of missing values using imputation has been proposed as a better approach. However, imputation methods have largely been designed for randomly missing data, whereas trait data are often not missing at random (e.g., more data for bigger species). Here, we evaluate the performance of approaches for handling missing values when considering biased datasets. Location Any. Time period Any. Major taxa studied Any. Methods We simulated continuous traits and separate response variables to test the performance of nine imputation methods and complete-case analysis (excluding missing values from the dataset) under biased missing data scenarios. We characterized performance by estimating the error in imputed trait values (deviation from the true value) and inferred trait-response relationships (deviation from the true relationship between a trait and response). Results Generally,Rphyloparsimputation produced the most accurate estimate of missing values and best preserved the response-trait slope. However, estimates of missing data were still inaccurate, even with only 5% of values missing. Under severe biases, errors were high with every approach. Imputation was not always the best option, with complete-case analysis frequently outperformingMiceimputation and, to a lesser degree,BHPMFimputation.Mice, a popular approach, performed poorly when the response variable was excluded from the imputation model. Main conclusions Imputation can handle missing data effectively in some conditions but is not always the best solution. None of the methods we tested could deal effectively with severe biases, which can be common in trait datasets. We recommend rigorous data checking for biases before and after imputation and propose variables that can assist researchers working with incomplete datasets to detect data biases and minimize errors.

引用

页码：51 / 62

页数：12

共 50 条

[31] Handling missing values in population data:: consequences for maximum likelihood estimation of haplotype frequencies
Gourraud, PA
Génin, E
Cambon-Thomsen, A
[J]. EUROPEAN JOURNAL OF HUMAN GENETICS, 2004, 12 (10) : 805 - 812
[32] Handling high-dimensional data with missing values by modern machine learning techniques
Chen, Sixia
Xu, Chao
[J]. JOURNAL OF APPLIED STATISTICS, 2023, 50 (03) : 786 - 804
[33] Handling missing values in population data: consequences for maximum likelihood estimation of haplotype frequencies
Pierre-Antoine Gourraud
Emmanuelle Génin
Anne Cambon-Thomsen
[J]. European Journal of Human Genetics, 2004, 12 : 805 - 812
[34] On Handling Missing Values in Data Stream Mining Algorithms Based on the Restricted Boltzmann Machine
Jaworski, Maciej
Duda, Piotr
Rutkowska, Danuta
Rutkowski, Leszek
[J]. NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 347 - 354
[35] Handling missing values in marketing research using SOM
Grabowski, M
[J]. INNOVATIONS IN CLASSIFICATION, DATA SCIENCE, AND INFORMATION SYSTEMS, 2005, : 322 - 329
[36] Rough sets handling missing values probabilistically interpreted
Nakata, M
Sakai, H
[J]. ROUGH SETS, FUZZY SETS, DATA MINING, AND GRANULAR COMPUTING, PT 1, PROCEEDINGS, 2005, 3641 : 325 - 334
[37] Handling missing values via decomposition of the conditioned set
Shyu, ML
Kuruppu-Appuhamilage, IP
Chen, SC
Chang, LW
[J]. PROCEEDINGS OF THE 2005 IEEE INTERNATIONAL CONFERENCE ON INFORMATION REUSE AND INTEGRATION, 2005, : 199 - 204
[38] Handling of missing values in path models for opinions or attitudes
Maassen, GH
[J]. EUROPEAN JOURNAL OF SOCIAL PSYCHOLOGY, 1996, 26 (01) : 1 - 13
[39] Handling Missing Values in the Unified Dyskinesia Rating Scale
Luo, Sheng
Goetz, Christopher
Stebbins, Glenn
[J]. MOVEMENT DISORDERS, 2018, 33 : S57 - S57
[40] Handling missing values in support vector machine classifiers
Pelckmans, K
De Brabanter, J
Suykens, JAK
De Moor, B
[J]. NEURAL NETWORKS, 2005, 18 (5-6) : 684 - 692

← 1 2 3 4 5 →