Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data

被引:143
|
作者
Chung, Neo Christopher [1 ]
Miasojedow, Blazej [2 ]
Startek, Michal [1 ]
Gambin, Anna [1 ]
机构
[1] Univ Warsaw, Fac Math Informat & Mech, Inst Informat, Stefana Banacha 2, PL-02097 Warsaw, Poland
[2] Polish Acad Sci, Inst Math, Jana & Jedrzeja Sniadeckich 8, PL-00656 Warsaw, Poland
关键词
Jaccard; Tanimoto; Binary similarity; Presence-absence; Co-occurrences; P-value; SPECIES COOCCURRENCES; BETA-DIVERSITY; NONRANDOMNESS; COMMUNITIES; MODEL;
D O I
10.1186/s12859-019-3118-5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background: A survey of presences and absences of specific species across multiple biogeographic units (or bioregions) are used in a broad area of biological studies from ecology to microbiology. Using binary presence-absence data, we evaluate species co-occurrences that help elucidate relationships among organisms and environments. To summarize similarity between occurrences of species, we routinely use the Jaccard/Tanimoto coefficient, which is the ratio of their intersection to their union. It is natural, then, to identify statistically significant Jaccard/Tanimoto coefficients, which suggest non-random co-occurrences of species. However, statistical hypothesis testing using this similarity coefficient has been seldom used or studied. Results: We introduce a hypothesis test for similarity for biological presence-absence data, using the Jaccard/Tanimoto coefficient. Several key improvements are presented including unbiased estimation of expectation and centered Jaccard/Tanimoto coefficients, that account for occurrence probabilities. The exact and asymptotic solutions are derived. To overcome a computational burden due to high-dimensionality, we propose the bootstrap and measurement concentration algorithms to efficiently estimate statistical significance of binary similarity. Comprehensive simulation studies demonstrate that our proposed methods produce accurate p-values and false discovery rates. The proposed estimation methods are orders of magnitude faster than the exact solution, particularly with an increasing dimensionality. We showcase their applications in evaluating co-occurrences of bird species in 28 islands of Vanuatu and fish species in 3347 freshwater habitats in France. The proposed methods are implemented in an open source R package called jaccard (https://cran.r-project.org/package=jaccard). Conclusion: We introduce a suite of statistical methods for the Jaccard/Tanimoto similarity coefficient for binary data, that enable straightforward incorporation of probabilistic measures in analysis for species co-occurrences. Due to their generality, the proposed methods and implementations are applicable to a wide range of binary data arising from genomics, biochemistry, and other areas of science.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Jaccard/Tanimoto similarity test and estimation methods for biological presence-absence data
    Neo Christopher Chung
    BłaŻej Miasojedow
    Michał Startek
    Anna Gambin
    BMC Bioinformatics, 20
  • [2] Jaccard/Tanimoto similarity test and estimation methods
    Chung, Neo Christopher
    Miasojedow, Blażej
    Startek, Michal
    Gambin, Anna
    arXiv, 2019,
  • [3] A reliability index for presence-absence data
    Schaalje, GB
    Beus, BD
    COMMUNICATIONS IN STATISTICS-THEORY AND METHODS, 1997, 26 (02) : 355 - 374
  • [4] PRESENCE-ONLY AND PRESENCE-ABSENCE DATA FOR COMPARING SPECIES DISTRIBUTION MODELING METHODS
    Elith, Jane
    Graham, Catherine
    Valavi, Roozbeh
    Abegg, Meinrad
    Bruce, Caroline
    Ford, Andrew
    Guisan, Antoine
    Hijmans, Robert J.
    Huettmann, Falk
    Lohmann, Lucia
    Loiselle, Bette
    Moritz, Craig
    Overton, Jake
    Peterson, A. Townsend
    Phillips, Steven
    Richardson, Karen
    Williams, Stephen
    Wiser, Susan K.
    Wohlgemuth, Thomas
    Zimmermann, Niklaus E.
    Ferrier, Simon
    BIODIVERSITY INFORMATICS, 2020, 15 (02) : 69 - 80
  • [5] PROGENY TESTING USING PRESENCE-ABSENCE DATA
    CURNOW, RN
    ADVANCES IN APPLIED PROBABILITY, 1981, 13 (01) : 1 - 2
  • [6] EMPTY SITES AND THE ANALYSIS OF PRESENCE-ABSENCE DATA
    WRIGHT, SJ
    BIEHL, CC
    AMERICAN NATURALIST, 1983, 122 (06): : 833 - 834
  • [7] Measuring beta diversity for presence-absence data
    Koleff, P
    Gaston, KJ
    Lennon, JJ
    JOURNAL OF ANIMAL ECOLOGY, 2003, 72 (03) : 367 - 382
  • [8] Improved abundance prediction from presence-absence data
    Conlisk, Erin
    Conlisk, John
    Enquist, Brian
    Thompson, Jill
    Harte, John
    GLOBAL ECOLOGY AND BIOGEOGRAPHY, 2009, 18 (01): : 1 - 10
  • [9] Exploring multiple presence-absence data structures in ecology
    Podani, Janos
    Odor, Peter
    Fattorini, Simone
    Strona, Giovanni
    Heino, Jani
    Schmera, Denes
    ECOLOGICAL MODELLING, 2018, 383 : 41 - 51
  • [10] VISUALIZING SPECIES RICHNESS AND SITE SIMILARITY FROM PRESENCE-ABSENCE MATRICES
    Soberon, Jorge
    Cobos, Marlon E.
    Nunez-Penichet, Claudia
    BIODIVERSITY INFORMATICS, 2021, 16 (01) : 20 - 27