On rare variants in principal component analysis of population stratification

被引:17
|
作者
Ma, Shengqing [1 ]
Shi, Gang [1 ]
机构
[1] Xidian Univ, State Key Lab Integrated Serv Networks, 2 South Taibai Rd, Xian 710071, Shaanxi, Peoples R China
关键词
Rare variant; Population stratification; Principal component analysis; Single nucleotide polymorphism; ASSOCIATION; MODEL; INFERENCE;
D O I
10.1186/s12863-020-0833-x
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Background Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used. Results We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose F-PC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d(2), sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio F-PC abates, the population distance d(2) decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio F-PC is 93.85, population distance d(2) is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01. Conclusions The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] On rare variants in principal component analysis of population stratification
    Shengqing Ma
    Gang Shi
    BMC Genetics, 21
  • [2] Adjustment for Population Stratification via Principal Components in Association Analysis of Rare Variants
    Zhang, Yiwei
    Guan, Weihua
    Pan, Wei
    GENETIC EPIDEMIOLOGY, 2013, 37 (01) : 99 - 109
  • [3] Principal-Component Analysis for Assessment of Population Stratification in Mitochondrial Medical Genetics
    Biffi, Alessandro
    Anderson, Christopher D.
    Nalls, Michael A.
    Rahman, Rosanna
    Sonni, Akshata
    Cortellini, Lynelle
    Rost, Natalia S.
    Matarin, Mar
    Hernandez, Dena G.
    Plourde, Anna
    de Bakker, Paul I. W.
    Ross, Owen A.
    Greenberg, Steven M.
    Furie, Karen L.
    Meschia, James F.
    Singleton, Andrew B.
    Saxena, Richa
    Rosand, Jonathan
    AMERICAN JOURNAL OF HUMAN GENETICS, 2010, 86 (06) : 904 - 917
  • [4] Logistic Principal Component Analysis for Rare Variants in Gene-Environment Interaction Analysis
    Lu, Meng
    Lee, Hye-Seung
    Hadley, David
    Huang, Jianhua Z.
    Qian, Xiaoning
    2012 IEEE INTERNATIONAL WORKSHOP ON GENOMIC SIGNAL PROCESSING AND STATISTICS (GENSIPS), 2012, : 122 - 125
  • [5] Logistic Principal Component Analysis for Rare Variants in Gene-Environment Interaction Analysis
    Lu, Meng
    Lee, Hye-Seung
    Hadley, David
    Huang, Jianhua Z.
    Qian, Xiaoning
    IEEE-ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS, 2014, 11 (06) : 1020 - 1028
  • [6] Principal Component Analysis Corrects for Population Stratification in Studies of Gene-Environment Interactions
    Viktorova, Elena
    Sohns, Melanie
    Bickeboeller, Heike
    GENETIC EPIDEMIOLOGY, 2012, 36 (07) : 730 - 730
  • [7] Effect of population stratification analysis on false-positive rates for common and rare variants
    Hua He
    Xue Zhang
    Lili Ding
    Tesfaye M Baye
    Brad G Kurowski
    Lisa J Martin
    BMC Proceedings, 5 (Suppl 9)
  • [8] Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest
    Zhang, Weiwen
    Cheng, Lianglun
    Huang, Guoheng
    GENES & GENOMICS, 2021, 43 (10) : 1143 - 1155
  • [9] Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest
    Weiwen Zhang
    Lianglun Cheng
    Guoheng Huang
    Genes & Genomics, 2021, 43 : 1143 - 1155
  • [10] A comparison between similarity matrices for principal component analysis to assess population stratification in sequenced genetic data sets
    Lee, Sanghun
    Hahn, Georg
    Hecker, Julian
    Lutz, Sharon M.
    Mullin, Kristina
    Hide, Winston
    Bertram, Lars
    Demeo, Dawn L.
    Tanzi, Rudolph E.
    Lange, Christoph
    Prokopenko, Dmitry
    BRIEFINGS IN BIOINFORMATICS, 2023, 24 (01)