Detecting selection in low-coverage high-throughput sequencing data using principal component analysis

被引:11
|
作者
Meisner, Jonas [1 ]
Albrechtsen, Anders [1 ]
Hanghoj, Kristian [1 ]
机构
[1] Univ Copenhagen, Bioinformat Ctr, Dept Biol, Copenhagen, Denmark
关键词
CONVERGENT EVOLUTION; POPULATION;
D O I
10.1186/s12859-021-04375-2
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Background Identification of selection signatures between populations is often an important part of a population genetic study. Leveraging high-throughput DNA sequencing larger sample sizes of populations with similar ancestries has become increasingly common. This has led to the need of methods capable of identifying signals of selection in populations with a continuous cline of genetic differentiation. Individuals from continuous populations are inherently challenging to group into meaningful units which is why existing methods rely on principal components analysis for inference of the selection signals. These existing methods require called genotypes as input which is problematic for studies based on low-coverage sequencing data. Materials and methods We have extended two principal component analysis based selection statistics to genotype likelihood data and applied them to low-coverage sequencing data from the 1000 Genomes Project for populations with European and East Asian ancestry to detect signals of selection in samples with continuous population structure. Results Here, we present two selections statistics which we have implemented in the PCAngsd framework. These methods account for genotype uncertainty, opening for the opportunity to conduct selection scans in continuous populations from low and/or variable coverage sequencing data. To illustrate their use, we applied the methods to low-coverage sequencing data from human populations of East Asian and European ancestries and show that the implemented selection statistics can control the false positive rate and that they identify the same signatures of selection from low-coverage sequencing data as state-of-the-art software using high quality called genotypes. Conclusion We show that selection scans of low-coverage sequencing data of populations with similar ancestry perform on par with that obtained from high quality genotype data. Moreover, we demonstrate that PCAngsd outperform selection statistics obtained from called genotypes from low-coverage sequencing data without the need for ad-hoc filtering.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] Detecting selection in low-coverage high-throughput sequencing data using principal component analysis
    Jonas Meisner
    Anders Albrechtsen
    Kristian Hanghøj
    [J]. BMC Bioinformatics, 22
  • [2] Heap: a highly sensitive and accurate SNP detection tool for low-coverage high-throughput sequencing data
    Kobayashi, Masaaki
    Ohyanagi, Hajime
    Takanashi, Hideki
    Asano, Satomi
    Kudo, Toru
    Kajiya-Kanegae, Hiromi
    Nagano, Atsushi J.
    Tainaka, Hitoshi
    Tokunaga, Tsuyoshi
    Sazuka, Takashi
    Iwata, Hiroyoshi
    Tsutsumi, Nobuhiro
    Yano, Kentaro
    [J]. DNA RESEARCH, 2017, 24 (04) : 397 - 405
  • [3] Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
    Mas-Sandoval, Alex
    Pope, Nathaniel S.
    Nielsen, Knud Nor
    Altinkaya, Isin
    Fumagalli, Matteo
    Korneliussen, Thorfinn Sand
    [J]. GIGASCIENCE, 2022, 11
  • [4] Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
    Mas-Sandoval, Alex
    Pope, Nathaniel S.
    Nielsen, Knud Nor
    Altinkaya, Isin
    Fumagalli, Matteo
    Korneliussen, Thorfinn Sand
    [J]. GIGASCIENCE, 2022, 11
  • [5] Fast and accurate estimation of multidimensional site frequency spectra from low-coverage high-throughput sequencing data
    Mas-Sandoval, Alex
    Pope, Nathaniel S.
    Nielsen, Knud Nor
    Altinkaya, Isin
    Fumagalli, Matteo
    Korneliussen, Thorfinn Sand
    [J]. GIGASCIENCE, 2022, 11
  • [6] Linkage Disequilibrium Estimation in Low Coverage High-Throughput Sequencing Data
    Bilton, Timothy P.
    McEwan, John C.
    Clarke, Shannon M.
    Brauning, Rudiger
    van Stijn, Tracey C.
    Rowe, Suzanne J.
    Dodds, Ken G.
    [J]. GENETICS, 2018, 209 (02) : 389 - 400
  • [7] Improved computations for relationship inference using low-coverage sequencing data
    Mostad, Petter
    Tillmar, Andreas
    Kling, Daniel
    [J]. BMC BIOINFORMATICS, 2023, 24 (01)
  • [8] Detecting Alu insertions from high-throughput sequencing data
    David, Matei
    Mustafa, Harun
    Brudno, Michael
    [J]. NUCLEIC ACIDS RESEARCH, 2013, 41 (17)
  • [9] Detecting Pathogenic Structural Variants with Low-Coverage PacBio Sequencing
    Hickey, L.
    Wenger, A. M.
    Baybayan, P.
    Peluso, P.
    Korlach, J.
    [J]. EUROPEAN JOURNAL OF HUMAN GENETICS, 2018, 26 : 729 - 729
  • [10] Improved computations for relationship inference using low-coverage sequencing data
    Petter Mostad
    Andreas Tillmar
    Daniel Kling
    [J]. BMC Bioinformatics, 24