Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data

被引:2
|
作者
Sugolov, Anton [1 ]
Emmenegger, Eric [2 ]
Paterson, Andrew D. [3 ,4 ]
Sun, Lei [4 ,5 ]
机构
[1] Univ Toronto, Fac Arts & Sci, Dept Math, Toronto, ON, Canada
[2] Univ Toronto, Dept Mech & Ind Engn, Toronto, ON, Canada
[3] Univ Toronto, Hosp Sick Children, Program Genet & Genome Biol, Toronto, ON, Canada
[4] Univ Toronto, Dalla Lana Sch Publ Hlth, Toronto, ON, Canada
[5] Univ Toronto, Fac Arts & Sci, Dalla Lana Sch Publ Hlth, Dept Stat Sci, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会; 加拿大健康研究院;
关键词
1000 Genomes Project; Data Visualization; Genome-wide Association Study; Gene Expression; Hands-on Experience; Large-scale Data Analysis; Multiple Hypothesis Testing; Open Resource; Reproducible Research; UK BIOBANK; SCIENCE;
D O I
10.1007/s12561-023-09375-9
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain similar to 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [41] Pathway Analysis Using Genome-Wide Association Study Data for Coronary Restenosis - A Potential Role for the PARVB Gene
    Verschuren, Jeffrey J. W.
    Trompet, Stella
    Sampietro, M. Lourdes
    Heijmans, Bastiaan T.
    Koch, Werner
    Kastrati, Adnan
    Houwing-Duistermaat, Jeanine J.
    Slagboom, P. Eline
    Quax, Paul H. A.
    Jukema, J. Wouter
    PLOS ONE, 2013, 8 (08):
  • [42] Characterization of variability in large-scale gene expression data: Implications for study design
    Novak, JP
    Sladek, R
    Hudson, TJ
    GENOMICS, 2002, 79 (01) : 104 - 113
  • [43] Novel Blood Pressure Locus and Gene Discovery Using Genome-Wide Association Study and Expression Data Sets From Blood and the Kidney
    Wain, Louise V.
    Vaez, Ahmad
    Jansen, Rick
    Joehanes, Roby
    van der Most, Peter J.
    Erzurumluoglu, A. Mesut
    O'Reilly, Paul F.
    Cabrera, Claudia P.
    Warren, Helen R.
    Rose, Lynda M.
    Verwoert, Germaine C.
    Hottenga, Jouke-Jan
    Strawbridge, Rona J.
    Esko, Tonu
    Arking, Dan E.
    Hwang, Shih-Jen
    Guo, Xiuqing
    Kutalik, Zoltan
    Trompet, Stella
    Shrine, Nick
    Teumer, Alexander
    Ried, Janina S.
    Bis, Joshua C.
    Smith, Albert V.
    Amin, Najaf
    Nolte, Ilja M.
    Lyytikainen, Leo-Pekka
    Mahajan, Anubha
    Wareham, Nicholas J.
    Hofer, Edith
    Joshi, Peter K.
    Kristiansson, Kati
    Traglia, Michela
    Havulinna, Aki S.
    Goel, Anuj
    Nalls, Mike A.
    Sober, Siim
    Vuckovic, Dragana
    Luan, Jian'an
    Del Greco M, Fabiola
    Ayers, Kristin L.
    Marrugat, Jaume
    Ruggiero, Daniela
    Lopez, Lorna M.
    Niiranen, Teemu
    Enroth, Stefan
    Jackson, Anne U.
    Nelson, Christopher P.
    Huffman, Jennifer E.
    Zhang, Weihua
    HYPERTENSION, 2017, 70 (03) : E4 - +
  • [44] An Integrated Approach of Learning Genetic Networks From Genome-Wide Gene Expression Data Using Gaussian Graphical Model and Monte Carlo Method
    Zhao, Haitao
    Datta, Sujay
    Duan, Zhong-Hui
    BIOINFORMATICS AND BIOLOGY INSIGHTS, 2023, 17
  • [45] Identification of disease-associated pathways in pancreatic cancer by integrating genome-wide association study and gene expression data
    Long, Jin
    Liu, Zhe
    Wu, Xingda
    Xu, Yuanhong
    Ge, Chunlin
    ONCOLOGY LETTERS, 2016, 12 (01) : 537 - 543
  • [46] GRIMP: a web- and grid-based tool for high-speed analysis of large-scale genome-wide association using imputed data
    Estrada, Karol
    Abuseiris, Anis
    Grosveld, Frank G.
    Uitterlinden, Andre G.
    Knoch, Tobias A.
    Rivadeneira, Fernando
    BIOINFORMATICS, 2009, 25 (20) : 2750 - 2752
  • [47] Enriching the gene set analysis of genome-wide data by incorporating directionality of gene expression and combining statistical hypotheses and methods
    Varemo, Leif
    Nielsen, Jens
    Nookaew, Intawat
    NUCLEIC ACIDS RESEARCH, 2013, 41 (08) : 4378 - 4391
  • [48] Statistical Learning Methods Applicable to Genome-Wide Association Studies on Unbalanced Case-Control Disease Data
    Dai, Xiaotian
    Fu, Guifang
    Zhao, Shaofei
    Zeng, Yifei
    GENES, 2021, 12 (05)
  • [49] Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes
    Eleftheria Zeggini
    Laura J Scott
    Richa Saxena
    Benjamin F Voight
    Jonathan L Marchini
    Tianle Hu
    Paul IW de Bakker
    Gonçalo R Abecasis
    Peter Almgren
    Gitte Andersen
    Kristin Ardlie
    Kristina Bengtsson Boström
    Richard N Bergman
    Lori L Bonnycastle
    Knut Borch-Johnsen
    Noël P Burtt
    Hong Chen
    Peter S Chines
    Mark J Daly
    Parimal Deodhar
    Chia-Jen Ding
    Alex S F Doney
    William L Duren
    Katherine S Elliott
    Michael R Erdos
    Timothy M Frayling
    Rachel M Freathy
    Lauren Gianniny
    Harald Grallert
    Niels Grarup
    Christopher J Groves
    Candace Guiducci
    Torben Hansen
    Christian Herder
    Graham A Hitman
    Thomas E Hughes
    Bo Isomaa
    Anne U Jackson
    Torben Jørgensen
    Augustine Kong
    Kari Kubalanza
    Finny G Kuruvilla
    Johanna Kuusisto
    Claudia Langenberg
    Hana Lango
    Torsten Lauritzen
    Yun Li
    Cecilia M Lindgren
    Valeriya Lyssenko
    Amanda F Marvelle
    Nature Genetics, 2008, 40 : 638 - 645
  • [50] Large-scale meta-analysis of genome-wide association data identifies six new risk loci for Parkinson's disease
    Nalls, Mike A.
    Pankratz, Nathan
    Lill, Christina M.
    Do, Chuong B.
    Hernandez, Dena G.
    Saad, Mohamad
    DeStefano, Anita L.
    Kara, Eleanna
    Bras, Jose
    Sharma, Manu
    Schulte, Claudia
    Keller, Margaux F.
    Arepalli, Sampath
    Letson, Christopher
    Edsall, Connor
    Stefansson, Hreinn
    Liu, Xinmin
    Pliner, Hannah
    Lee, Joseph H.
    Cheng, Rong
    Ikram, M. Arfan
    Ioannidis, John P. A.
    Hadjigeorgiou, Georgios M.
    Bis, Joshua C.
    Martinez, Maria
    Perlmutter, Joel S.
    Goate, Alison
    Marder, Karen
    Fiske, Brian
    Sutherland, Margaret
    Xiromerisiou, Georgia
    Myers, Richard H.
    Clark, Lorraine N.
    Stefansson, Kari
    Hardy, John A.
    Heutink, Peter
    Chen, Honglei
    Wood, Nicholas W.
    Houlden, Henry
    Payami, Haydeh
    Brice, Alexis
    Scott, William K.
    Gasser, Thomas
    Bertram, Lars
    Eriksson, Nicholas
    Foroud, Tatiana
    Singleton, Andrew B.
    NATURE GENETICS, 2014, 46 (09) : 989 - +