SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

被引：8

作者：

Zhang, Di ^{[1
]}

Zhao, Linhai ^{[1
]}

Li, Biao ^{[1
]}

He, Zongxiao ^{[1
]}

Wang, Gao T. ^{[2
]}

Liu, Dajiang J. ^{[3
]}

Leal, Suzanne M. ^{[1
]}

机构：

[1] Baylor Coll Med, Dept Mol & Human Genet, Ctr Stat Genet, Houston, TX 77030 USA

[2] Univ Chicago, Dept Human Genet, Chicago, IL 60637 USA

[3] Penn State Univ, Coll Med, Dept Publ Hlth Sci, Hershey, PA 17033 USA

来源：

AMERICAN JOURNAL OF HUMAN GENETICS | 2017年 / 101卷 / 01期

关键词：

GENERAL FRAMEWORK; GENETIC-VARIATION; WIDE ASSOCIATION; PARTICIPANTS; PROJECT; DISEASE; HEALTH;

D O I：

10.1016/j.ajhg.2017.05.017

中图分类号：

Q3 [遗传学];

学科分类号：

071007 ; 090102 ;

摘要：

Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 x 10(-6)) was observed with CCDC62 (SKAT-O [p = 6.89 x 10(-7)], combined multivariate collapsing [p = 1.48 x 10(-6)], and burden of rare variants [p = 1.48 x 10(-6)]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.

引用

页码：115 / 122

页数：8

共 50 条

[41] Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data
Yang Wu
Zhili Zheng
Peter M. Visscher
Jian Yang
Genome Biology, 18
[42] Quantifying the mapping precision of genome-wide association studies using whole-genome sequencing data
Wu, Yang
Zheng, Zhili
Visscher, Peter M.
Yang, Jian
GENOME BIOLOGY, 2017, 18
[43] Large-Scale de novo Oligonucleotide Synthesis for Whole-Genome Synthesis and Data Storage: Challenges and Opportunities
Song, Li-Fu
Deng, Zheng-Hua
Gong, Zi-Yi
Li, Lu-Lu
Li, Bing-Zhi
FRONTIERS IN BIOENGINEERING AND BIOTECHNOLOGY, 2021, 9
[44] A rare variant non-parametric linkage method for nuclear and extended pedigrees with application to exome and whole genome sequence data
Zhao, Linhai
Zhang, Di
Broadbent, Carl A.
Wang, Gao T.
Vardarajan, Badri N.
Goate, Alison M.
Mayeux, Richard
Leal, Suzanne M.
GENETIC EPIDEMIOLOGY, 2018, 42 (07) : 748 - 748
[45] Rare variants analysis using penalization methods for whole genome sequence data
Akram Yazdani
Azam Yazdani
Eric Boerwinkle
BMC Bioinformatics, 16
[46] Rare variants analysis using penalization methods for whole genome sequence data
Yazdani, Akram
Yazdani, Azam
Boerwinkle, Eric
BMC BIOINFORMATICS, 2015, 16
[47] Imputation to whole-genome sequence using multiple pig populations and its use in genome-wide association studies
Sanne van den Berg
Jérémie Vandenplas
Fred A. van Eeuwijk
Aniek C. Bouwman
Marcos S. Lopes
Roel F. Veerkamp
Genetics Selection Evolution, 51
[48] Imputation to whole-genome sequence using multiple pig populations and its use in genome-wide association studies
van den Berg, Sanne
Vandenplas, Jeremie
van Eeuwijk, Fred A.
Bouwman, Aniek C.
Lopes, Marcos S.
Veerkamp, Roel F.
GENETICS SELECTION EVOLUTION, 2019, 51 (1)
[49] Whole-genome amplification template combined with highly multiplexed SNP typing enables large-scale association studies from archived DNA samples
Pask, R
Rance, H
Walker, N
Lam, A
Smink, L
Smyth, D
Barratt, BJ
Todd, JA
AMERICAN JOURNAL OF HUMAN GENETICS, 2003, 73 (05) : 438 - 438
[50] RARE VARIANT ANALYSIS FOR POST- TRAUMATIC STRESS DISORDER USING WHOLE-EXOME-SEQUENCING DATA
Tan-Hoang Nguyen
Coleman, Jonathan R. I.
Gentry, Amanda Elswick
Webb, Bradley T.
Peterson, Roseann E.
Kendler, Kenneth S.
Riley, Brien P.
Amstadter, Ananda B.
Sheerin, Christina M.
EUROPEAN NEUROPSYCHOPHARMACOLOGY, 2024, 87 : 207 - 208

← 1 2 3 4 5 →