ploidyfrost: Reference-free estimation of ploidy level from whole genome sequencing data based on de Bruijn graphs

被引：5

作者：

Sun, Mingzhu

Pang, Erli

Bai, Wei-Ning

Zhang, Da-Yong

Lin, Kui ^{[1
,2
]}

机构：

[1] Beijing Normal Univ, State Key Lab Earth Surface Proc & Resource Ecol, Beijing, Peoples R China

[2] Beijing Normal Univ, Minist Educ, Key Lab Biodivers Sci & Ecol Engn, Coll Life Sci, Beijing, Peoples R China

来源：

MOLECULAR ECOLOGY RESOURCES | 2023年 / 23卷 / 02期

基金：

国家重点研发计划;

关键词：

de Bruijn graph; ploidy estimation; polyploidy; whole genome sequencing; POLYPLOIDY; PLANTS; ACID;

D O I：

10.1111/1755-0998.13720

中图分类号：

Q5 [生物化学]; Q7 [分子生物学];

学科分类号：

071010 ; 081704 ;

摘要：

Polyploidy is ubiquitous and its consequences are complex and variable. A change of ploidy level generally influences genetic diversity and results in morphological, physiological and ecological differences between cells or organisms with different ploidy levels. To avoid cumbersome experiments and take advantage of the less biased information provided by the vast amounts of genome sequencing data, computational tools for ploidy estimation are urgently needed. Until now, although a few such tools have been developed, many aspects of this estimation, such as the requirement of a reference genome, the lack of informative results and objective inferences, and the influence of false positives from errors and repeats, need further improvement. We have developed ploidyfrost, a de Bruijn graph-based method, to estimate ploidy levels from whole genome sequencing data sets without a reference genome. ploidyfrost provides a visual representation of allele frequency distribution generated using the ggplot2 package as well as quantitative results using the Gaussian mixture model. In addition, it takes advantage of colouring information encoded in coloured de Bruijn graphs to analyse multiple samples simultaneously and to flexibly filter putative false positives. We evaluated the performance of ploidyfrost by analysing highly heterozygous or repetitive samples of Cyclocarya paliurus and a complex allooctoploid sample of Fragaria x ananassa. Moreover, we demonstrated that the accuracy of analysis results can be improved by constraining a threshold such as Cramer's V coefficient on variant features, which may significantly reduce the side effects of sequencing errors and annoying repeats on the graphical structure constructed.

引用

页码：499 / 510

页数：12

共 48 条

[11] Compacting de Bruijn graphs from sequencing data quickly and in low memory
Chikhi, Rayan
Limasset, Antoine
Medvedev, Paul
BIOINFORMATICS, 2016, 32 (12) : 201 - 208
[12] Homopolymer Compression Improves Reference-Free, Kmer Based Whole Genome Strain Comparison for Ion Torrent Data
Simmon, K. E.
Mallory, M.
Couturier, B. A.
Krueger, C.
Gee, E. P.
Barker, A. P.
Fisher, M. A.
JOURNAL OF MOLECULAR DIAGNOSTICS, 2017, 19 (06): : 995 - 995
[13] Reference-free inference of tumor phylogenies from single-cell sequencing data
Ayshwarya Subramanian
Russell Schwartz
BMC Genomics, 16
[14] Reference-free Inference of Tumor Phylogenies from Single-Cell Sequencing Data
Subramanian, Ayshwarya
Schwartz, Russell
2014 IEEE 4TH INTERNATIONAL CONFERENCE ON COMPUTATIONAL ADVANCES IN BIO AND MEDICAL SCIENCES (ICCABS), 2014,
[15] Reference-free inference of tumor phylogenies from single-cell sequencing data
Subramanian, Ayshwarya
Schwartz, Russell
BMC GENOMICS, 2015, 16
[16] ConPADE: Genome Assembly Ploidy Estimation from Next-Generation Sequencing Data
Margarido, Gabriel R. A.
Heckerman, David
PLOS COMPUTATIONAL BIOLOGY, 2015, 11 (04)
[17] Erratum to: ‘Reference-free inference of tumor phylogenies from single-cell sequencing data’
Ayshwarya Subramanian
Russell Schwartz
BMC Genomics, 17
[18] High-Accuracy HLA Type Inference from Whole-Genome Sequencing Data Using Population Reference Graphs
Dilthey, Alexander T.
Gourraud, Pierre-Antoine
Mentzer, Alexander J.
Cereb, Nezih
Iqbal, Zamin
McVean, Gil
PLOS COMPUTATIONAL BIOLOGY, 2016, 12 (10)
[19] Telomerecat: A ploidy-agnostic method for estimating telomere length from whole genome sequencing data
Farmery, James H. R.
Smith, Mike L.
Lynch, Andy G.
SCIENTIFIC REPORTS, 2018, 8
[20] Telomerecat: A ploidy-agnostic method for estimating telomere length from whole genome sequencing data
James H. R. Farmery
Mike L. Smith
Andy G. Lynch
Scientific Reports, 8

← 1 2 3 4 5 →