Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data

被引:2
|
作者
Sugolov, Anton [1 ]
Emmenegger, Eric [2 ]
Paterson, Andrew D. [3 ,4 ]
Sun, Lei [4 ,5 ]
机构
[1] Univ Toronto, Fac Arts & Sci, Dept Math, Toronto, ON, Canada
[2] Univ Toronto, Dept Mech & Ind Engn, Toronto, ON, Canada
[3] Univ Toronto, Hosp Sick Children, Program Genet & Genome Biol, Toronto, ON, Canada
[4] Univ Toronto, Dalla Lana Sch Publ Hlth, Toronto, ON, Canada
[5] Univ Toronto, Fac Arts & Sci, Dalla Lana Sch Publ Hlth, Dept Stat Sci, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会; 加拿大健康研究院;
关键词
1000 Genomes Project; Data Visualization; Genome-wide Association Study; Gene Expression; Hands-on Experience; Large-scale Data Analysis; Multiple Hypothesis Testing; Open Resource; Reproducible Research; UK BIOBANK; SCIENCE;
D O I
10.1007/s12561-023-09375-9
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain similar to 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [21] Genetic risk factors for periodontitis: a genome-wide association study using UK Biobank data
    Gao, Chenyi
    Iles, Mark M.
    Bishop, David Timothy
    Larvin, Harriet
    Bunce, David
    Wu, Bei
    Luo, Huabin
    Nibali, Luigi
    Pavitt, Susan
    Wu, Jianhua
    Kang, Jing
    CLINICAL ORAL INVESTIGATIONS, 2025, 29 (02)
  • [22] Finding regulatory modules through large-scale gene-expression data analysis
    Kloster, M
    Tang, C
    Wingreen, NS
    BIOINFORMATICS, 2005, 21 (07) : 1172 - 1179
  • [23] Large-scale genome-wide association study, using historical data, identifies conserved genetic architecture of cyanogenic glucoside content in cassava (Manihot esculenta Crantz) root
    Ogbonna, Alex C.
    Braatz de Andrade, Luciano Rogerio
    Rabbi, Ismail Y.
    Mueller, Lukas A.
    de Oliveira, Eder Jorge
    Bauchet, Guillaume J.
    PLANT JOURNAL, 2021, 105 (03): : 754 - 770
  • [24] Bagging Statistical Network Inference from Large-Scale Gene Expression Data
    Simoes, Ricardo de Matos
    Emmert-Streib, Frank
    PLOS ONE, 2012, 7 (03):
  • [25] A genome-wide linkage study of GAW15 gene expression data
    Donghui Kan
    Richard Cooper
    Xiaofeng Zhu
    BMC Proceedings, 1 (Suppl 1)
  • [26] Large-scale Exploration of Gene-Gene Interactions in Prostate Cancer Using a Multistage Genome-wide Association Study
    Ciampa, Julia
    Yeager, Meredith
    Amundadottir, Laufey
    Jacobs, Kevin
    Kraft, Peter
    Chung, Charles
    Wacholder, Sholom
    Yu, Kai
    Wheeler, William
    Thun, Michael J.
    Divers, W. Ryan
    Gapstur, Susan
    Albanes, Demetrius
    Virtamo, Jarmo
    Weinstein, Stephanie
    Giovannucci, Edward
    Willett, Walter C.
    Cancel-Tassin, Geraldine
    Cussenot, Olivier
    Valeri, Antoine
    Hunter, David
    Hoover, Robert
    Thomas, Gilles
    Chanock, Stephen
    Chatterjee, Nilanjan
    CANCER RESEARCH, 2011, 71 (09) : 3287 - 3295
  • [27] GPLEXUS: enabling genome-scale gene association network reconstruction and analysis for very large-scale expression data
    Li, Jun
    Wei, Hairong
    Liu, Tingsong
    Zhao, Patrick Xuechun
    NUCLEIC ACIDS RESEARCH, 2014, 42 (05)
  • [28] Defining transcription modules using large-scale gene expression data
    Ihmels, J
    Bergmann, S
    Barkai, N
    BIOINFORMATICS, 2004, 20 (13) : 1993 - 2003
  • [29] Efficient analysis of large-scale genome-wide data with two R packages: bigstatsr and bigsnpr
    Prive, Florian
    Aschard, Hugues
    Ziyatdinov, Andrey
    Blum, Michael G. B.
    BIOINFORMATICS, 2018, 34 (16) : 2781 - 2787
  • [30] Analysis of genome-wide association study data using the protein knowledge base
    Sara Ballouz
    Jason Y Liu
    Martin Oti
    Bruno Gaeta
    Diane Fatkin
    Melanie Bahlo
    Merridee A Wouters
    BMC Genetics, 12