The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis

被引:0
|
作者
Ho, Chi-Hsuan [1 ]
Huang, Yu-Jyun [1 ]
Lai, Ying-Ju [1 ]
Mukherjee, Rajarshi [2 ]
Hsiao, Chuhsing Kate [1 ,3 ]
机构
[1] Natl Taiwan Univ, Inst Epidemiol & Prevent Med, Div Biostat & Data Sci, Taipei 10055, Taiwan
[2] Harvard Univ, Dept Biostat, Boston, MA 02494 USA
[3] Natl Taiwan Univ, Ctr Genom Med, Bioinformat & Biostat Core, Taipei 10055, Taiwan
来源
G3-GENES GENOMES GENETICS | 2021年 / 12卷 / 01期
关键词
association study; gene expression; gene set analysis; multivariate normality test; pathway analysis; ANTIFUNGAL SUSCEPTIBILITY PROFILES; CANDIDA-ORTHOPSILOSIS; CRYPTOCOCCUS-NEOFORMANS; GENOME SEQUENCE; METAPSILOSIS; PARAPSILOSIS; PREVALENCE; EMERGENCE; STRAINS; HYBRID;
D O I
暂无
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with an implicit assumption that the multivariate expression values are normally distributed. This assumption is commonly adopted in GSAs, particularly those in the group of functional class scoring (FCS) methods. The validity of the normality assumption, however, has been disputed in several studies, yet no systematic analysis has been carried out to assess the effect of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal (MVN) distribution. Six statistical methods in three categories of MVN tests were considered and applied to a total of 24 RNA data sets. These RNA values were collected from cancer patients as well as normal subjects, and the values were derived from microarray experiments, RNA sequencing, and single-cell RNA sequencing. Our first finding suggests that the MVN assumption is not always satisfied. This assumption does not hold true in many applications tested here. In the second part of this research, we evaluated the influence of non-normality on the statistical power of current FCS methods, both parametric and nonparametric ones. Specifically, the scenario of mixture distributions representing more than one population for the RNA values was considered. This second investigation demonstrates that the non-normality distribution of the RNA values causes a loss in the statistical power of these GSA tests, especially when subtypes exist. Among the FCS GSA tools examined here and among the scenarios studied in this research, the N-statistics outperform the others. Based on the results from these two investigations, we conclude that the assumption of MVN should be used with caution when evaluating new GSA tools, since this assumption cannot be guaranteed and violation may lead to spurious results, loss of power, and incorrect comparison between methods. If a newly proposed GSA tool is to be evaluated, we recommend the incorporation of a wide range of multivariate non-normal distributions or sampling from large databases if available.
引用
收藏
页数:12
相关论文
共 50 条
  • [21] A study on alternatives to the permutation test in gene-set analysis
    Lee, Sunho
    KOREAN JOURNAL OF APPLIED STATISTICS, 2018, 31 (02) : 241 - 251
  • [22] Investigating the effect of paralogs on microarray gene-set analysis
    Andre J Faure
    Cathal Seoighe
    Nicola J Mulder
    BMC Bioinformatics, 12
  • [23] Pitfalls in the application of gene-set analysis to genetics studies
    Sedeno-Cortes, Adriana Estela
    Pavlidis, Paul
    TRENDS IN GENETICS, 2014, 30 (12) : 513 - 514
  • [24] De-correlating expression in gene-set analysis
    Nam, Dougu
    BIOINFORMATICS, 2010, 26 (18) : i511 - i516
  • [25] Investigating the effect of paralogs on microarray gene-set analysis
    Faure, Andre J.
    Seoighe, Cathal
    Mulder, Nicola J.
    BMC BIOINFORMATICS, 2011, 12
  • [26] Network enrichment analysis: extension of gene-set enrichment analysis to gene networks
    Andrey Alexeyenko
    Woojoo Lee
    Maria Pernemalm
    Justin Guegan
    Philippe Dessen
    Vladimir Lazar
    Janne Lehtiö
    Yudi Pawitan
    BMC Bioinformatics, 13
  • [27] Network enrichment analysis: extension of gene-set enrichment analysis to gene networks
    Alexeyenko, Andrey
    Lee, Woojoo
    Pernemalm, Maria
    Guegan, Justin
    Dessen, Philippe
    Lazar, Vladimir
    Lehtio, Janne
    Pawitan, Yudi
    BMC BIOINFORMATICS, 2012, 13
  • [28] Application of the parametric bootstrap for gene-set analysis of gene–environment interactions
    Brandon J. Coombes
    Joanna M. Biernacka
    European Journal of Human Genetics, 2018, 26 : 1679 - 1686
  • [29] GSAE: an autoencoder with embedded gene-set nodes for genomics functional characterization
    Chen, Hung-I Harry
    Chiu, Yu-Chiao
    Zhang, Tinghe
    Zhang, Songyao
    Huang, Yufei
    Chen, Yidong
    BMC SYSTEMS BIOLOGY, 2018, 12
  • [30] Gene-set meta-analysis of lung cancer identifies pathway related to systemic lupus erythematosus
    Rosenberger, Albert
    Sohns, Melanie
    Friedrichs, Stefanie
    Hung, Rayjean J.
    Fehringer, Gord
    McLaughlin, John
    Amos, Christopher I.
    Brennan, Paul
    Risch, Angela
    Brueske, Irene
    Caporaso, Neil E.
    Landi, Maria Teresa
    Christiani, David C.
    Wei, Yongyue
    Bickeboeller, Heike
    PLOS ONE, 2017, 12 (03):