SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

被引:8
|
作者
Zhang, Di [1 ]
Zhao, Linhai [1 ]
Li, Biao [1 ]
He, Zongxiao [1 ]
Wang, Gao T. [2 ]
Liu, Dajiang J. [3 ]
Leal, Suzanne M. [1 ]
机构
[1] Baylor Coll Med, Dept Mol & Human Genet, Ctr Stat Genet, Houston, TX 77030 USA
[2] Univ Chicago, Dept Human Genet, Chicago, IL 60637 USA
[3] Penn State Univ, Coll Med, Dept Publ Hlth Sci, Hershey, PA 17033 USA
关键词
GENERAL FRAMEWORK; GENETIC-VARIATION; WIDE ASSOCIATION; PARTICIPANTS; PROJECT; DISEASE; HEALTH;
D O I
10.1016/j.ajhg.2017.05.017
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 x 10(-6)) was observed with CCDC62 (SKAT-O [p = 6.89 x 10(-7)], combined multivariate collapsing [p = 1.48 x 10(-6)], and burden of rare variants [p = 1.48 x 10(-6)]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.
引用
收藏
页码:115 / 122
页数:8
相关论文
共 50 条
  • [41] Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data
    Yang Wu
    Zhili Zheng
    Peter M. Visscher
    Jian Yang
    Genome Biology, 18
  • [42] Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data
    Wu, Yang
    Zheng, Zhili
    Visscher, Peter M.
    Yang, Jian
    GENOME BIOLOGY, 2017, 18
  • [43] Large-Scale de novo Oligonucleotide Synthesis for Whole-Genome Synthesis and Data Storage: Challenges and Opportunities
    Song, Li-Fu
    Deng, Zheng-Hua
    Gong, Zi-Yi
    Li, Lu-Lu
    Li, Bing-Zhi
    FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY, 2021, 9
  • [44] A rare variant non-parametric linkage method for nuclear and extended pedigrees with application to exome and whole genome sequence data
    Zhao, Linhai
    Zhang, Di
    Broadbent, Carl A.
    Wang, Gao T.
    Vardarajan, Badri N.
    Goate, Alison M.
    Mayeux, Richard
    Leal, Suzanne M.
    GENETIC EPIDEMIOLOGY, 2018, 42 (07) : 748 - 748
  • [45] Rare variants analysis using penalization methods for whole genome sequence data
    Akram Yazdani
    Azam Yazdani
    Eric Boerwinkle
    BMC Bioinformatics, 16
  • [46] Rare variants analysis using penalization methods for whole genome sequence data
    Yazdani, Akram
    Yazdani, Azam
    Boerwinkle, Eric
    BMC BIOINFORMATICS, 2015, 16
  • [47] Imputation to whole-genome sequence using multiple pig populations and its use in genome-wide association studies
    Sanne van den Berg
    Jérémie Vandenplas
    Fred A. van Eeuwijk
    Aniek C. Bouwman
    Marcos S. Lopes
    Roel F. Veerkamp
    Genetics Selection Evolution, 51
  • [48] Imputation to whole-genome sequence using multiple pig populations and its use in genome-wide association studies
    van den Berg, Sanne
    Vandenplas, Jeremie
    van Eeuwijk, Fred A.
    Bouwman, Aniek C.
    Lopes, Marcos S.
    Veerkamp, Roel F.
    GENETICS SELECTION EVOLUTION, 2019, 51 (1)
  • [49] Whole-genome amplification template combined with highly multiplexed SNP typing enables large-scale association studies from archived DNA samples
    Pask, R
    Rance, H
    Walker, N
    Lam, A
    Smink, L
    Smyth, D
    Barratt, BJ
    Todd, JA
    AMERICAN JOURNAL OF HUMAN GENETICS, 2003, 73 (05) : 438 - 438
  • [50] RARE VARIANT ANALYSIS FOR POST- TRAUMATIC STRESS DISORDER USING WHOLE-EXOME-SEQUENCING DATA
    Tan-Hoang Nguyen
    Coleman, Jonathan R. I.
    Gentry, Amanda Elswick
    Webb, Bradley T.
    Peterson, Roseann E.
    Kendler, Kenneth S.
    Riley, Brien P.
    Amstadter, Ananda B.
    Sheerin, Christina M.
    EUROPEAN NEUROPSYCHOPHARMACOLOGY, 2024, 87 : 207 - 208