SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies Using Whole-Genome and Exome Sequence Data

被引:8
|
作者
Zhang, Di [1 ]
Zhao, Linhai [1 ]
Li, Biao [1 ]
He, Zongxiao [1 ]
Wang, Gao T. [2 ]
Liu, Dajiang J. [3 ]
Leal, Suzanne M. [1 ]
机构
[1] Baylor Coll Med, Dept Mol & Human Genet, Ctr Stat Genet, Houston, TX 77030 USA
[2] Univ Chicago, Dept Human Genet, Chicago, IL 60637 USA
[3] Penn State Univ, Coll Med, Dept Publ Hlth Sci, Hershey, PA 17033 USA
关键词
GENERAL FRAMEWORK; GENETIC-VARIATION; WIDE ASSOCIATION; PARTICIPANTS; PROJECT; DISEASE; HEALTH;
D O I
10.1016/j.ajhg.2017.05.017
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Massively parallel sequencing technologies provide great opportunities for discovering rare susceptibility variants involved in complex disease etiology via large-scale imputation and exome and whole-genome sequence-based association studies. Due to modest effect sizes, large sample sizes of tens to hundreds of thousands of individuals are required for adequately powered studies. Current analytical tools are obsolete when it comes to handling these large datasets. To facilitate the analysis of large-scale sequence-based studies, we developed SEQSpark which implements parallel processing based on Spark to increase the speed and efficiency of performing data quality control, annotation, and association analysis. To demonstrate the versatility and speed of SEQSpark, we analyzed whole-genome sequence data from the UK10K, testing for associations with waist-to-hip ratios. The analysis, which was completed in 1.5 hr, included loading data, annotation, principal component analysis, and single variant and rare variant aggregate association analysis of >9 million variants. For rare variant aggregate analysis, an exome-wide significant association (p < 2.5 x 10(-6)) was observed with CCDC62 (SKAT-O [p = 6.89 x 10(-7)], combined multivariate collapsing [p = 1.48 x 10(-6)], and burden of rare variants [p = 1.48 x 10(-6)]). SEQSpark was also used to analyze 50,000 simulated exomes and it required 1.75 hr for the analysis of a quantitative trait using several rare variant aggregate association methods. Additionally, the performance of SEQSpark was compared to Variant Association Tools and PLINK/SEQ. SEQSpark was always faster and in some situations computation was reduced to a hundredth of the time. SEQSpark will empower large sequence-based epidemiological studies to quickly elucidate genetic variation involved in the etiology of complex traits.
引用
收藏
页码:115 / 122
页数:8
相关论文
共 50 条
  • [1] SEQSpark: A Complete Analysis Tool for Large-Scale Rare Variant Association Studies using Whole Genome and Exome Sequence Data
    Zhang, Di
    Zhao, Linhai
    Li, Biao
    He, Zongxiao
    Wang, Gao T.
    Liu, Dajiang J.
    Leal, Suzanne M.
    GENETIC EPIDEMIOLOGY, 2017, 41 (07) : 646 - 646
  • [2] Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
    Zhao, Shanrong
    Prenger, Kurt
    Smith, Lance
    Messina, Thomas
    Fan, Hongtao
    Jaeger, Edward
    Stephens, Susan
    BMC GENOMICS, 2013, 14
  • [3] Rainbow: a tool for large-scale whole-genome sequencing data analysis using cloud computing
    Shanrong Zhao
    Kurt Prenger
    Lance Smith
    Thomas Messina
    Hongtao Fan
    Edward Jaeger
    Susan Stephens
    BMC Genomics, 14
  • [4] A statistical framework for multi-trait rare variant analysis in large-scale whole-genome sequencing studies
    Li, Xihao
    Chen, Han
    Selvaraj, Margaret Sunitha
    Van Buren, Eric
    Zhou, Hufeng
    Wang, Yuxuan
    Sun, Ryan
    McCaw, Zachary R.
    Yu, Zhi
    Jiang, Min-Zhi
    DiCorpo, Daniel
    Gaynor, Sheila M.
    Dey, Rounak
    Arnett, Donna K.
    Benjamin, Emelia J.
    Bis, Joshua C.
    Blangero, John
    Boerwinkle, Eric
    Bowden, Donald W.
    Brody, Jennifer A.
    Cade, Brian E.
    Carson, April P.
    Carlson, Jenna C.
    Chami, Nathalie
    Chen, Yii-Der Ida
    Curran, Joanne E.
    de Vries, Paul S.
    Fornage, Myriam
    Franceschini, Nora
    Freedman, Barry I.
    Gu, Charles
    Heard-Costa, Nancy L.
    He, Jiang
    Hou, Lifang
    Hung, Yi-Jen
    Irvin, Marguerite R.
    Kaplan, Robert C.
    Kardia, Sharon L. R.
    Kelly, Tanika N.
    Konigsberg, Iain
    Kooperberg, Charles
    Kral, Brian G.
    Li, Changwei
    Li, Yun
    Lin, Honghuang
    Liu, Ching-Ti
    Loos, Ruth J. F.
    Mahaney, Michael C.
    Martin, Lisa W.
    Mathias, Rasika A.
    NATURE COMPUTATIONAL SCIENCE, 2025, : 125 - 143
  • [5] A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies
    Zilin Li
    Xihao Li
    Hufeng Zhou
    Sheila M. Gaynor
    Margaret Sunitha Selvaraj
    Theodore Arapoglou
    Corbin Quick
    Yaowu Liu
    Han Chen
    Ryan Sun
    Rounak Dey
    Donna K. Arnett
    Paul L. Auer
    Lawrence F. Bielak
    Joshua C. Bis
    Thomas W. Blackwell
    John Blangero
    Eric Boerwinkle
    Donald W. Bowden
    Jennifer A. Brody
    Brian E. Cade
    Matthew P. Conomos
    Adolfo Correa
    L. Adrienne Cupples
    Joanne E. Curran
    Paul S. de Vries
    Ravindranath Duggirala
    Nora Franceschini
    Barry I. Freedman
    Harald H. H. Göring
    Xiuqing Guo
    Rita R. Kalyani
    Charles Kooperberg
    Brian G. Kral
    Leslie A. Lange
    Bridget M. Lin
    Ani Manichaikul
    Alisa K. Manning
    Lisa W. Martin
    Rasika A. Mathias
    James B. Meigs
    Braxton D. Mitchell
    May E. Montasser
    Alanna C. Morrison
    Take Naseri
    Jeffrey R. O’Connell
    Nicholette D. Palmer
    Patricia A. Peyser
    Bruce M. Psaty
    Laura M. Raffield
    Nature Methods, 2022, 19 : 1599 - 1611
  • [6] A framework for detecting noncoding rare-variant associations of large-scale whole-genome sequencing studies
    Li, Zilin
    Li, Xihao
    Zhou, Hufeng
    Gaynor, Sheila M.
    Selvaraj, Margaret Sunitha
    Arapoglou, Theodore
    Quick, Corbin
    Liu, Yaowu
    Chen, Han
    Sun, Ryan
    Dey, Rounak
    Arnett, Donna K.
    Auer, Paul L.
    Bielak, Lawrence F.
    Bis, Joshua C.
    Blackwell, Thomas W.
    Blangero, John
    Boerwinkle, Eric
    Bowden, Donald W.
    Brody, Jennifer A.
    Cade, Brian E.
    Conomos, Matthew P.
    Correa, Adolfo
    Cupples, L. Adrienne
    Curran, Joanne E.
    de Vries, Paul S.
    Duggirala, Ravindranath
    Franceschini, Nora
    Freedman, Barry, I
    Goring, Harald H. H.
    Guo, Xiuqing
    Kalyani, Rita R.
    Kooperberg, Charles
    Kral, Brian G.
    Lange, Leslie A.
    Lin, Bridget M.
    Manichaikul, Ani
    Manning, Alisa K.
    Martin, Lisa W.
    Mathias, Rasika A.
    Meigs, James B.
    Mitchell, Braxton D.
    Montasser, May E.
    Morrison, Alanna C.
    Naseri, Take
    O'Connell, Jeffrey R.
    Palmer, Nicholette D.
    Peyser, Patricia A.
    Psaty, Bruce M.
    Raffield, Laura M.
    NATURE METHODS, 2022, 19 (12) : 1599 - +
  • [7] Rare variant analysis in large-scale association and sequencing studies
    Zeggini, Eleftheria
    JOURNAL OF MEDICAL GENETICS, 2011, 48 : S24 - S24
  • [8] Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale
    Xihao Li
    Zilin Li
    Hufeng Zhou
    Sheila M. Gaynor
    Yaowu Liu
    Han Chen
    Ryan Sun
    Rounak Dey
    Donna K. Arnett
    Stella Aslibekyan
    Christie M. Ballantyne
    Lawrence F. Bielak
    John Blangero
    Eric Boerwinkle
    Donald W. Bowden
    Jai G. Broome
    Matthew P. Conomos
    Adolfo Correa
    L. Adrienne Cupples
    Joanne E. Curran
    Barry I. Freedman
    Xiuqing Guo
    George Hindy
    Marguerite R. Irvin
    Sharon L. R. Kardia
    Sekar Kathiresan
    Alyna T. Khan
    Charles L. Kooperberg
    Cathy C. Laurie
    X. Shirley Liu
    Michael C. Mahaney
    Ani W. Manichaikul
    Lisa W. Martin
    Rasika A. Mathias
    Stephen T. McGarvey
    Braxton D. Mitchell
    May E. Montasser
    Jill E. Moore
    Alanna C. Morrison
    Jeffrey R. O’Connell
    Nicholette D. Palmer
    Akhil Pampana
    Juan M. Peralta
    Patricia A. Peyser
    Bruce M. Psaty
    Susan Redline
    Kenneth M. Rice
    Stephen S. Rich
    Jennifer A. Smith
    Hemant K. Tiwari
    Nature Genetics, 2020, 52 : 969 - 983
  • [9] Dynamic incorporation of multiple in silico functional annotations empowers rare variant association analysis of large whole-genome sequencing studies at scale
    Li, Xihao
    Li, Zilin
    Zhou, Hufeng
    Gaynor, Sheila M.
    Liu, Yaowu
    Chen, Han
    Sun, Ryan
    Dey, Rounak
    Arnett, Donna K.
    Aslibekyan, Stella
    Ballantyne, Christie M.
    Bielak, Lawrence F.
    Blangero, John
    Boerwinkle, Eric
    Bowden, Donald W.
    Broome, Jai G.
    Conomos, Matthew P.
    Correa, Adolfo
    Cupples, L. Adrienne
    Curran, Joanne E.
    Freedman, Barry I.
    Guo, Xiuqing
    Hindy, George
    Irvin, Marguerite R.
    Kardia, Sharon L. R.
    Kathiresan, Sekar
    Khan, Alyna T.
    Kooperberg, Charles L.
    Laurie, Cathy C.
    Liu, X. Shirley
    Mahaney, Michael C.
    Manichaikul, Ani W.
    Martin, Lisa W.
    Mathias, Rasika A.
    McGarvey, Stephen T.
    Mitchell, Braxton D.
    Montasser, May E.
    Moore, Jill E.
    Morrison, Alanna C.
    O'Connell, Jeffrey R.
    Palmer, Nicholette D.
    Pampana, Akhil
    Peralta, Juan M.
    Peyser, Patricia A.
    Psaty, Bruce M.
    Redline, Susan
    Rice, Kenneth M.
    Rich, Stephen S.
    Smith, Jennifer A.
    Tiwari, Hemant K.
    NATURE GENETICS, 2020, 52 (09) : 969 - +
  • [10] A whole-genome association approach for large-scale interspecies traits
    Huizhong Fan
    Lei Chen
    Yibo Hu
    Guohui Shi
    Yi Dai
    Fuwen Wei
    Qi Wu
    Science China(Life Sciences), 2021, (08) : 1372 - 1374