Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data

被引:2
|
作者
Sugolov, Anton [1 ]
Emmenegger, Eric [2 ]
Paterson, Andrew D. [3 ,4 ]
Sun, Lei [4 ,5 ]
机构
[1] Univ Toronto, Fac Arts & Sci, Dept Math, Toronto, ON, Canada
[2] Univ Toronto, Dept Mech & Ind Engn, Toronto, ON, Canada
[3] Univ Toronto, Hosp Sick Children, Program Genet & Genome Biol, Toronto, ON, Canada
[4] Univ Toronto, Dalla Lana Sch Publ Hlth, Toronto, ON, Canada
[5] Univ Toronto, Fac Arts & Sci, Dalla Lana Sch Publ Hlth, Dept Stat Sci, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会; 加拿大健康研究院;
关键词
1000 Genomes Project; Data Visualization; Genome-wide Association Study; Gene Expression; Hands-on Experience; Large-scale Data Analysis; Multiple Hypothesis Testing; Open Resource; Reproducible Research; UK BIOBANK; SCIENCE;
D O I
10.1007/s12561-023-09375-9
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain similar to 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [31] Analysis of genome-wide association study data using the protein knowledge base
    Ballouz, Sara
    Liu, Jason Y.
    Oti, Martin
    Gaeta, Bruno
    Fatkin, Diane
    Bahlo, Melanie
    Wouters, Merridee A.
    BMC GENETICS, 2011, 12
  • [32] Evaluation of genome-wide power of genetic association studies based on empirical data from the HapMap project
    Nannya, Yasuhito
    Taura, Kenjiro
    Kurokawa, Mineo
    Chiba, Shigeru
    Ogawa, Seishi
    HUMAN MOLECULAR GENETICS, 2007, 16 (20) : 2494 - 2505
  • [33] Data validation and statistical issues such as power and other considerations in genome-wide association study (GWAS)
    Tomita, Makoto
    WILEY INTERDISCIPLINARY REVIEWS-COMPUTATIONAL STATISTICS, 2023, 15 (03)
  • [34] Integrative pathway analysis of genome-wide association studies and gene expression data in prostate cancer
    Jia, Peilin
    Liu, Yang
    Zhao, Zhongming
    BMC SYSTEMS BIOLOGY, 2012, 6
  • [35] GENOME-WIDE ASSOCIATION STUDY OF EXTREME LONGEVITY USING WHOLE-GENOME SEQUENCING DATA
    Gurinovich, Anastasia
    Bae, Harold
    Song, Zeyuan
    Leshchyk, Anastasia
    Li, Mengze
    Andersen, Stacy
    Perls, Thomas
    Sebastiani, Paola
    INNOVATION IN AGING, 2022, 6 : 395 - 395
  • [36] Large-Scale Genome-Wide Association Meta-Analysis using Imputation from the Dense 1000 Genomes Project Map Identifies Novel Susceptibility Loci for Glycemic and Obesity Traits
    Horikoshi, Momoko
    Magi, Reedik
    Surakka, Ida
    Sarin, Antti-Pekka
    Mahajan, Anubha
    Marullo, Letizia
    Ferreira, Teresa
    Esko, Tonu
    Morris, Andrew P.
    Mccarthy, Mark I.
    Ripatti, Samuli
    Prokopenko, Inga
    DIABETES, 2013, 62 : A83 - A83
  • [37] TASUKE plus : a web-based platform for exploring genome-wide association studies results and large-scale resequencing data
    Kumagai, Masahiko
    Nishikawa, Daiki
    Kawahara, Yoshihiro
    Wakimoto, Hironobu
    Itoh, Ryutaro
    Tabei, Norio
    Tanaka, Tsuyoshi
    Itoh, Takeshi
    DNA RESEARCH, 2019, 26 (06) : 445 - 452
  • [38] Mixed models for time-to-event outcomes with large-scale population cohorts and genome-wide data
    Benner, Christian
    Pirinen, Matti
    Salomaa, Veikko
    Palmgren, Juni
    Ripatti, Samuli
    GENETIC EPIDEMIOLOGY, 2015, 39 (07) : 533 - 533
  • [39] A modular approach for integrative analysis of large-scale gene-expression and drug-response data
    Kutalik, Zoltan
    Beckmann, Jacques S.
    Bergmann, Sven
    NATURE BIOTECHNOLOGY, 2008, 26 (05) : 531 - 539
  • [40] A modular approach for integrative analysis of large-scale gene-expression and drug-response data
    Zoltán Kutalik
    Jacques S Beckmann
    Sven Bergmann
    Nature Biotechnology, 2008, 26 : 531 - 539