SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

被引:8
|
作者
Zhang, Di [1 ]
Zhao, Linhai [1 ]
Li, Biao [1 ]
He, Zongxiao [1 ]
Wang, Gao T. [2 ]
Liu, Dajiang J. [3 ]
Leal, Suzanne M. [1 ]
机构
[1] Baylor Coll Med, Dept Mol & Human Genet, Ctr Stat Genet, Houston, TX 77030 USA
[2] Univ Chicago, Dept Human Genet, Chicago, IL 60637 USA
[3] Penn State Univ, Coll Med, Dept Publ Hlth Sci, Hershey, PA 17033 USA
关键词
GENERAL FRAMEWORK; GENETIC-VARIATION; WIDE ASSOCIATION; PARTICIPANTS; PROJECT; DISEASE; HEALTH;
D O I
10.1016/j.ajhg.2017.05.017
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 x 10(-6)) was observed with CCDC62 (SKAT-O [p = 6.89 x 10(-7)], combined multivariate collapsing [p = 1.48 x 10(-6)], and burden of rare variants [p = 1.48 x 10(-6)]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.
引用
收藏
页码:115 / 122
页数:8
相关论文
共 50 条
  • [31] Genome-wide association studies of growth traits in three dairy cattle breeds using whole-genome sequence data
    Mao, X.
    Sahana, G.
    De Koning, D. -J.
    Guldbrandtsen, B.
    JOURNAL OF ANIMAL SCIENCE, 2016, 94 (04) : 1426 - 1437
  • [32] Large-scale whole-exome sequencing association studies identify rare functional variants influencing serum urate levels
    Tin, Adrienne
    Li, Yong
    Brody, Jennifer A.
    Nutile, Teresa
    Chu, Audrey Y.
    Huffman, Jennifer E.
    Yang, Qiong
    Chen, Ming-Huei
    Robinson-Cohen, Cassianne
    Mace, Aurelien
    Liu, Jun
    Demirkan, Ayse
    Sorice, Rossella
    Sedaghat, Sanaz
    Swen, Melody
    Yu, Bing
    Ghasemi, Sahar
    Teumer, Alexanda
    Vollenweider, Peter
    Ciullo, Marina
    Li, Meng
    Uitterlinden, Andre G.
    Kraaij, Robert
    Amin, Najaf
    van Rooij, Jeroen
    Kutalik, Zoltan
    Dehghan, Abbas
    McKnight, Barbara
    van Duijn, Cornelia M.
    Morrison, Alanna
    Psaty, Bruce M.
    Boerwinkle, Eric
    Fox, Caroline S.
    Woodward, Owen M.
    Koettgen, Anna
    NATURE COMMUNICATIONS, 2018, 9
  • [33] Tutorial: guidelines for quality filtering of whole-exome and whole-genome sequencing data for population-scale association analyses
    Sealock, Julia M.
    Ivankovic, Franjo
    Liao, Calwing
    Chen, Siwei
    Churchhouse, Claire
    Karczewski, Konrad J.
    Howrigan, Daniel P.
    Neale, Benjamin M.
    NATURE PROTOCOLS, 2025,
  • [34] Large-scale whole-exome sequencing association studies identify rare functional variants influencing serum urate levels
    Adrienne Tin
    Yong Li
    Jennifer A. Brody
    Teresa Nutile
    Audrey Y. Chu
    Jennifer E. Huffman
    Qiong Yang
    Ming-Huei Chen
    Cassianne Robinson-Cohen
    Aurélien Macé
    Jun Liu
    Ayşe Demirkan
    Rossella Sorice
    Sanaz Sedaghat
    Melody Swen
    Bing Yu
    Sahar Ghasemi
    Alexanda Teumer
    Peter Vollenweider
    Marina Ciullo
    Meng Li
    André G. Uitterlinden
    Robert Kraaij
    Najaf Amin
    Jeroen van Rooij
    Zoltán Kutalik
    Abbas Dehghan
    Barbara McKnight
    Cornelia M. van Duijn
    Alanna Morrison
    Bruce M. Psaty
    Eric Boerwinkle
    Caroline S. Fox
    Owen M. Woodward
    Anna Köttgen
    Nature Communications, 9
  • [35] Targeted analysis of rare variant sets within noncoding regulatory regions using whole genome sequence data
    Flanagan, Jack
    Lee, Seunggeun
    EUROPEAN JOURNAL OF HUMAN GENETICS, 2024, 32 : 781 - 781
  • [36] Rare variant analysis of blood pressure phenotypes in the Genetic Analysis Workshop 18 whole genome sequencing data using sequence kernel association test
    Cates Mallaney
    Yun Ju Sung
    BMC Proceedings, 8 (Suppl 1)
  • [37] A large-scale whole-genome sequencing analysis reveals false positives of bacterial essential genes
    Yuanhao Li
    Bo Jiang
    Weijun Dai
    Applied Microbiology and Biotechnology, 2022, 106 : 341 - 347
  • [38] A large-scale whole-genome sequencing analysis reveals false positives of bacterial essential genes
    Li, Yuanhao
    Jiang, Bo
    Dai, Weijun
    APPLIED MICROBIOLOGY AND BIOTECHNOLOGY, 2022, 106 (01) : 341 - 347
  • [39] Whole-genome approaches for large-scale gene identification and expression analysis in mammalian preimplantation embryos
    Adjaye, J
    REPRODUCTION FERTILITY AND DEVELOPMENT, 2005, 17 (1-2) : 37 - 45
  • [40] Analysis of large-scale biobanks and whole genome sequencing studies: Challenges and opportunities
    Lin, Xihong
    GENETIC EPIDEMIOLOGY, 2020, 44 (05) : 501 - 502