How reliable is metabarcoding for pollen identification? An evaluation of different taxonomic assignment strategies by cross-validation

被引:2
|
作者
San Martin, Gilles [1 ]
Hautier, Louis [1 ]
Mingeot, Dominique [2 ]
Dubois, Benjamin [2 ]
机构
[1] Walloon Agr Res Ctr, Life Sci Dept, Plant Hlth & Forest Unit, Gembloux, Belgium
[2] Walloon Agr Res Ctr, Life Sci Dept, Bioengn Unit, Gembloux, Belgium
来源
PEERJ | 2024年 / 12卷
关键词
Metabarcoding; Taxonomic assignments; BLAST; Cross-validation; Accuracy; ITS2; rbcL; Pollen; Honey bee; LIGHT-MICROSCOPY;
D O I
10.7717/peerj.16567
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Metabarcoding is a powerful tool, increasingly used in many disciplines of environmental sciences. However, to assign a taxon to a DNA sequence, bioinformaticians need to choose between different strategies or parameter values and these choices sometimes seem rather arbitrary. In this work, we present a case study on ITS2 and rbcL databases used to identify pollen collected by bees in Belgium. We blasted a random sample of sequences from the reference database against the remainder of the database using different strategies and compared the known taxonomy with the predicted one. This in silico cross -validation (CV) approach proved to be an easy yet powerful way to (1) assess the relative accuracy of taxonomic predictions, (2) define rules to discard dubious taxonomic assignments and (3) provide a more objective basis to choose the best strategy. We obtained the best results with the best blast hit (best bit score) rather than by selecting the majority taxon from the top 10 hits. The predictions were further improved by favouring the most frequent taxon among those with tied best bit scores. We obtained better results with databases containing the full sequences available on NCBI rather than restricting the sequences to the region amplified by the primers chosen in our study. Leaked CV showed that when the true sequence is present in the database, blast might still struggle to match the right taxon at the species level, particularly with rbcL. Classical 10 -fold CV-where the true sequence is removed from the database-offers a different yet more realistic view of the true error rates. Taxonomic predictions with this approach worked well up to the genus level, particularly for ITS2 (5-7% of errors). Using a database containing only the local flora of Belgium did not improve the predictions up to the genus level for local species and made them worse for foreign species. At the species level, using a database containing exclusively local species improved the predictions for local species by similar to 12% but the error rate remained rather high: 25% for ITS2 and 42% for rbcL. Foreign species performed worse even when using a world database (59-79% of errors). We used classification trees and GLMs to model the % of errors vs. identity and consensus scores and determine appropriate thresholds below which the taxonomic assignment should be discarded. This resulted in a significant reduction in prediction errors, but at the cost of a much higher proportion of unassigned sequences. Despite this stringent filtering, at least 1/5 sequences deemed suitable for species -level identification ultimately proved to be misidentified. An examination of the variability in prediction accuracy between plant families showed that rbcL outperformed ITS2 for only two of the 27 families examined, and that the % correct species -level assignments were much better for some families (e.g. 95% for Sapindaceae) than for others (e.g. 35% for Salicaceae).
引用
收藏
页数:26
相关论文
共 13 条
  • [1] Evaluation of Experiment Designs for MIMO Identification by Cross-Validation
    Haggblom, Kurt E.
    IFAC PAPERSONLINE, 2016, 49 (07): : 308 - 313
  • [2] Evaluation of Sampling and Cross-Validation Tuning Strategies for Regional-Scale Machine Learning Classification
    Ramezan, Christopher A.
    Warner, Timothy A.
    Maxwell, Aaron E.
    REMOTE SENSING, 2019, 11 (02)
  • [3] Challenges of machine learning model validation using correlated behaviour data: Evaluation of cross-validation strategies and accuracy measures
    Ferdinandy, Bence
    Gerencser, Linda
    Corrieri, Luca
    Perez, Paula
    Ujvary, Dora
    Csizmadia, Gabor
    Miklosi, Adam
    PLOS ONE, 2020, 15 (07):
  • [4] Psychometric Evaluation of the Team Identification Scale among Greek Sport Fans: A Cross-validation Approach
    Theodorakis, Nicholas D.
    Dimmock, James
    Wann, Daniel
    Barlas, Achilleas
    EUROPEAN SPORT MANAGEMENT QUARTERLY, 2010, 10 (03) : 289 - 305
  • [5] Cross-validation of time-depth conversion and evaluation of different approaches in the Mesopotamian Basin, Iraq
    Al-Khazraji, Omar N. A.
    EXPLORATION GEOPHYSICS, 2023, 54 (03) : 288 - 315
  • [6] Evaluation of different methods for assessing intracellular fluid in healthy older people: A cross-validation study
    Dittmar, M
    Reber, H
    JOURNAL OF THE AMERICAN GERIATRICS SOCIETY, 2002, 50 (01) : 104 - 110
  • [7] Evaluation of Cross-Validation Strategies in Sequence-Based Binding Prediction-Using Deep Learning
    Lopez-del Rio, Angela
    None-Canals, Alfons
    Vidal, David
    Perera-Lluna, Alexandre
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2019, 59 (04) : 1645 - 1657
  • [8] Identification of PV system shading using a LiDAR-based solar resource assessment model: An evaluation and cross-validation
    Lingfors, David
    Killinger, Sven
    Engerer, Nicholas A.
    Widen, Joakim
    Bright, Jamie M.
    SOLAR ENERGY, 2018, 159 : 157 - 172
  • [9] Evaluation of computer vision for detecting agonistic behavior of pigs in a single-space feeding stall through blocked cross-validation strategies
    Han, Junjie
    Siegford, Janice
    Colbry, Dirk
    Lesiyon, Raymond
    Bosgraaf, Anna
    Chen, Chen
    Norton, Tomas
    Steibel, Juan P.
    COMPUTERS AND ELECTRONICS IN AGRICULTURE, 2023, 204
  • [10] Prediction quality of cattle behavior traits evaluated through different cross-validation strategies using wearable sensor data and machine learning algorithms
    Ribeiro, Leonardo Augusto Coelho
    Bresolin, Tiago
    Rosa, Guilherme J. M.
    Casagrande, Daniel Rume
    Camargo Danes, Marina De Arruda
    Dorea, Joao R.
    JOURNAL OF ANIMAL SCIENCE, 2020, 98 : 383 - 383