SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

被引:8
|
作者
Zhang, Di [1 ]
Zhao, Linhai [1 ]
Li, Biao [1 ]
He, Zongxiao [1 ]
Wang, Gao T. [2 ]
Liu, Dajiang J. [3 ]
Leal, Suzanne M. [1 ]
机构
[1] Baylor Coll Med, Dept Mol & Human Genet, Ctr Stat Genet, Houston, TX 77030 USA
[2] Univ Chicago, Dept Human Genet, Chicago, IL 60637 USA
[3] Penn State Univ, Coll Med, Dept Publ Hlth Sci, Hershey, PA 17033 USA
关键词
GENERAL FRAMEWORK; GENETIC-VARIATION; WIDE ASSOCIATION; PARTICIPANTS; PROJECT; DISEASE; HEALTH;
D O I
10.1016/j.ajhg.2017.05.017
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 x 10(-6)) was observed with CCDC62 (SKAT-O [p = 6.89 x 10(-7)], combined multivariate collapsing [p = 1.48 x 10(-6)], and burden of rare variants [p = 1.48 x 10(-6)]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.
引用
收藏
页码:115 / 122
页数:8
相关论文
共 50 条
  • [21] Variant Association Tools for Quality Control and Analysis of Large-Scale Sequence and Genotyping Array Data
    Wang, Gao T.
    Peng, Bo
    Leal, Suzanne M.
    AMERICAN JOURNAL OF HUMAN GENETICS, 2014, 94 (05) : 770 - 783
  • [22] Enrichment Analysis Informs Rare Variant Association Tests of Type 2 Diabetes and Glycemic Traits in CHARGE Whole-Genome Sequence
    Lent, Samantha
    Manning, Alisa
    Wessel, Jennifer
    Dupuis, Josee
    Meigs, James B.
    DIABETES, 2017, 66 : A478 - A478
  • [23] The contribution of whole-genome sequence data to genome-wide association studies in livestock: Outcomes and perspectives
    Ros-Freixedes, Roger
    LIVESTOCK SCIENCE, 2024, 281
  • [24] Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies
    Kristopher A. Standish
    Tristan M. Carland
    Glenn K. Lockwood
    Wayne Pfeiffer
    Mahidhar Tatineni
    C Chris Huang
    Sarah Lamberth
    Yauheniya Cherkas
    Carrie Brodmerkel
    Ed Jaeger
    Lance Smith
    Gunaretnam Rajagopal
    Mark E. Curran
    Nicholas J. Schork
    BMC Bioinformatics, 16
  • [25] Group-based variant calling leveraging next-generation supercomputing for large-scale whole-genome sequencing studies
    Standish, Kristopher A.
    Carland, Tristan M.
    Lockwood, Glenn K.
    Pfeiffer, Wayne
    Tatineni, Mahidhar
    Huang, C. Chris
    Lamberth, Sarah
    Cherkas, Yauheniya
    Brodmerkel, Carrie
    Jaeger, Ed
    Smith, Lance
    Rajagopal, Gunaretnam
    Curran, Mark E.
    Schork, Nicholas J.
    BMC BIOINFORMATICS, 2015, 16
  • [26] Dynamic Scan Procedure for Detecting Rare-Variant Association Regions in Whole-Genome Sequencing Studies
    Li, Zilin
    Li, Xihao
    Liu, Yaowu
    Shen, Jincheng
    Chen, Han
    Zhou, Hufeng
    Morrison, Alanna C.
    Boerwinkle, Eric
    Lin, Xihong
    AMERICAN JOURNAL OF HUMAN GENETICS, 2019, 104 (05) : 802 - 814
  • [27] Will whole genome amplification prove reliable FR large-scale association studies?
    Norton, N
    Ivanov, D
    Williams, NM
    Owen, MJ
    O'Donovan, MC
    AMERICAN JOURNAL OF MEDICAL GENETICS PART B-NEUROPSYCHIATRIC GENETICS, 2005, 138B (01) : 100 - 101
  • [28] Using pre-selected variants from large-scale whole-genome sequence data for single-step genomic predictions in pigs
    Jang, Sungbong
    Ros-Freixedes, Roger
    Hickey, John M.
    Chen, Ching-Yi
    Holl, Justin
    Herring, William O.
    Misztal, Ignacy
    Lourenco, Daniela
    GENETICS SELECTION EVOLUTION, 2023, 55 (01)
  • [29] Using pre-selected variants from large-scale whole-genome sequence data for single-step genomic predictions in pigs
    Sungbong Jang
    Roger Ros-Freixedes
    John M. Hickey
    Ching-Yi Chen
    Justin Holl
    William O. Herring
    Ignacy Misztal
    Daniela Lourenco
    Genetics Selection Evolution, 55
  • [30] MycoVarP: Mycobacterium Variant and Drug Resistance Prediction Pipeline for Whole-Genome Sequence Data Analysis
    Swargam, Sandeep
    Kumari, Indu
    Kumar, Amit
    Pradhan, Dibyabhaba
    Alam, Anwar
    Singh, Harpreet
    Jain, Anuja
    Devi, Kangjam Rekha
    Trivedi, Vishal
    Sarma, Jogesh
    Hanif, Mahmud
    Narain, Kanwar
    Ehtesham, Nasreen Zafar
    Hasnain, Seyed Ehtesham
    Ahmad, Shandar
    FRONTIERS IN BIOINFORMATICS, 2022, 1