Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data

被引:2
|
作者
Sugolov, Anton [1 ]
Emmenegger, Eric [2 ]
Paterson, Andrew D. [3 ,4 ]
Sun, Lei [4 ,5 ]
机构
[1] Univ Toronto, Fac Arts & Sci, Dept Math, Toronto, ON, Canada
[2] Univ Toronto, Dept Mech & Ind Engn, Toronto, ON, Canada
[3] Univ Toronto, Hosp Sick Children, Program Genet & Genome Biol, Toronto, ON, Canada
[4] Univ Toronto, Dalla Lana Sch Publ Hlth, Toronto, ON, Canada
[5] Univ Toronto, Fac Arts & Sci, Dalla Lana Sch Publ Hlth, Dept Stat Sci, Toronto, ON, Canada
基金
加拿大自然科学与工程研究理事会; 加拿大健康研究院;
关键词
1000 Genomes Project; Data Visualization; Genome-wide Association Study; Gene Expression; Hands-on Experience; Large-scale Data Analysis; Multiple Hypothesis Testing; Open Resource; Reproducible Research; UK BIOBANK; SCIENCE;
D O I
10.1007/s12561-023-09375-9
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Teaching statistics through engaging applications to contemporary large-scale datasets is essential to attracting students to the field. To this end, we developed a hands-on, week-long workshop for senior high-school or junior undergraduate students, without prior knowledge in statistical genetics but with some basic knowledge in data science, to conduct their own genome-wide association study (GWAS). The GWAS was performed for open source gene expression data, using publicly available human genetics data. Assisted by a detailed instruction manual, students were able to obtain similar to 1.4 million p-values from a real scientific study, within several days. This early motivation kept students engaged in learning the theories that support their results, including regression, data visualization, results interpretation, and large-scale multiple hypothesis testing. To further their learning motivation by emphasizing the personal connection to this type of data analysis, students were encouraged to make short presentations about how GWAS has provided insights into the genetic basis of diseases that are present in their friends or families. The appended open source, step-by-step instruction manual includes descriptions of the datasets used, the software needed, and results from the workshop. Additionally, scripts used in the workshop are archived on Github and Zenodo to further enhance reproducible research and training.
引用
收藏
页码:250 / 264
页数:15
相关论文
共 50 条
  • [1] Statistical Learning of Large-Scale Genetic Data: How to Run a Genome-Wide Association Study of Gene-Expression Data Using the 1000 Genomes Project Data
    Anton Sugolov
    Eric Emmenegger
    Andrew D. Paterson
    Lei Sun
    Statistics in Biosciences, 2024, 16 : 250 - 264
  • [2] Comparison of HapMap and 1000 Genomes Reference Panels in a Large-Scale Genome-Wide Association Study
    de Vries, Paul S.
    Sabater-Lleal, Maria
    Chasman, Daniel I.
    Trompet, Stella
    Ahluwalia, Tarunveer S.
    Teumer, Alexander
    Kleber, Marcus E.
    Chen, Ming-Huei
    Wang, Jie Jin
    Attia, John R.
    Marioni, Riccardo E.
    Steri, Maristella
    Weng, Lu-Chen
    Pool, Rene
    Grossmann, Vera
    Brody, Jennifer A.
    Venturini, Cristina
    Tanaka, Toshiko
    Rose, Lynda M.
    Oldmeadow, Christopher
    Mazur, Johanna
    Basu, Saonli
    Franberg, Mattias
    Yang, Qiong
    Ligthart, Symen
    Hottenga, Jouke J.
    Rumley, Ann
    Mulas, Antonella
    de Craen, Anton J. M.
    Grotevendt, Anne
    Taylor, Kent D.
    Delgado, Graciela E.
    Kifley, Annette
    Lopez, Lorna M.
    Berentzen, Tina L.
    Mangino, Massimo
    Bandinelli, Stefania
    Morrison, Alanna C.
    Hamsten, Anders
    Tofler, Geoffrey
    de Maat, Moniek P. M.
    Draisma, Harmen H. M.
    Lowe, Gordon D.
    Zoledziewska, Magdalena
    Sattar, Naveed
    Lackner, Karl J.
    Voelker, Uwe
    McKnight, Barbara
    Huang, Jie
    Holliday, Elizabeth G.
    PLOS ONE, 2017, 12 (01):
  • [3] Large-scale genome-wide association meta-analysis of the 1000 genomes project imputed data identifies novel susceptibility loci for glycaemic and obesity traits
    Horikoshi, M.
    Maegi, R.
    Surakka, I.
    Sarin, A. -P.
    Mahajan, A.
    Marullo, L.
    Ferreira, T.
    Esko, T.
    Lindgren, C. M.
    Morris, A. P.
    McCarthy, M. I.
    Ripatti, S.
    Prokopenko, I.
    DIABETOLOGIA, 2013, 56 : S60 - S61
  • [4] Analysis of genome-wide association data by large-scale Bayesian logistic regression
    Yuanjia Wang
    Nanshi Sha
    Yixin Fang
    BMC Proceedings, 3 (Suppl 7)
  • [5] Genome-wide identification of directed gene networks using large-scale population genomics data
    Luijk, Rene
    Dekkers, Koen F.
    van Iterson, Maarten
    Arindrarto, Wibowo
    Claringbould, Annique
    Hop, Paul
    Boomsma, Dorret, I
    van Duijn, Cornelia M.
    van Greevenbroek, Marleen M. J.
    Veldink, Jan H.
    Wijmenga, Cisca
    Franke, Lude
    't Hoend, Peter A. C.
    Jansen, Rick
    van Meurs, Joyce
    Mei, Hailiang
    Slagboomi, P. Eline
    Heijmans, Bastiaan T.
    van Zwet, Erik W.
    NATURE COMMUNICATIONS, 2018, 9
  • [6] Genome-wide identification of directed gene networks using large-scale population genomics data
    René Luijk
    Koen F. Dekkers
    Maarten van Iterson
    Wibowo Arindrarto
    Annique Claringbould
    Paul Hop
    Dorret I. Boomsma
    Cornelia M. van Duijn
    Marleen M. J. van Greevenbroek
    Jan H. Veldink
    Cisca Wijmenga
    Lude Franke
    Peter A. C. ’t Hoen
    Rick Jansen
    Joyce van Meurs
    Hailiang Mei
    P. Eline Slagboom
    Bastiaan T. Heijmans
    Erik W. van Zwet
    Nature Communications, 9
  • [7] Genome-wide association study of individual differences of human lymphocyte profiles using large-scale cytometry data
    Daigo Okada
    Naotoshi Nakamura
    Kazuya Setoh
    Takahisa Kawaguchi
    Koichiro Higasa
    Yasuharu Tabara
    Fumihiko Matsuda
    Ryo Yamada
    Journal of Human Genetics, 2021, 66 : 557 - 567
  • [8] Genome-wide association study of individual differences of human lymphocyte profiles using large-scale cytometry data
    Okada, Daigo
    Nakamura, Naotoshi
    Setoh, Kazuya
    Kawaguchi, Takahisa
    Higasa, Koichiro
    Tabara, Yasuharu
    Matsuda, Fumihiko
    Yamada, Ryo
    JOURNAL OF HUMAN GENETICS, 2021, 66 (06) : 557 - 567
  • [9] Empirical estimation of genome-wide significance thresholds based on the 1000 Genomes Project data set
    Kanai, Masahiro
    Tanaka, Toshihiro
    Okada, Yukinori
    JOURNAL OF HUMAN GENETICS, 2016, 61 (10) : 861 - 866
  • [10] Empirical estimation of genome-wide significance thresholds based on the 1000 Genomes Project data set
    Masahiro Kanai
    Toshihiro Tanaka
    Yukinori Okada
    Journal of Human Genetics, 2016, 61 : 861 - 866