The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis

被引:0
|
作者
Ho, Chi-Hsuan [1 ]
Huang, Yu-Jyun [1 ]
Lai, Ying-Ju [1 ]
Mukherjee, Rajarshi [2 ]
Hsiao, Chuhsing Kate [1 ,3 ]
机构
[1] Natl Taiwan Univ, Inst Epidemiol & Prevent Med, Div Biostat & Data Sci, Taipei 10055, Taiwan
[2] Harvard Univ, Dept Biostat, Boston, MA 02494 USA
[3] Natl Taiwan Univ, Ctr Genom Med, Bioinformat & Biostat Core, Taipei 10055, Taiwan
来源
G3-GENES GENOMES GENETICS | 2021年 / 12卷 / 01期
关键词
association study; gene expression; gene set analysis; multivariate normality test; pathway analysis; ANTIFUNGAL SUSCEPTIBILITY PROFILES; CANDIDA-ORTHOPSILOSIS; CRYPTOCOCCUS-NEOFORMANS; GENOME SEQUENCE; METAPSILOSIS; PARAPSILOSIS; PREVALENCE; EMERGENCE; STRAINS; HYBRID;
D O I
暂无
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Gene-set analysis (GSA) is a standard procedure for exploring potential biological functions of a group of genes. The development of its methodology has been an active research topic in recent decades. Many GSA methods, when newly proposed, rely on simulation studies to evaluate their performance with an implicit assumption that the multivariate expression values are normally distributed. This assumption is commonly adopted in GSAs, particularly those in the group of functional class scoring (FCS) methods. The validity of the normality assumption, however, has been disputed in several studies, yet no systematic analysis has been carried out to assess the effect of this distributional assumption. Our goal in this study is not to propose a new GSA method but to first examine if the multi-dimensional gene expression data in gene sets follow a multivariate normal (MVN) distribution. Six statistical methods in three categories of MVN tests were considered and applied to a total of 24 RNA data sets. These RNA values were collected from cancer patients as well as normal subjects, and the values were derived from microarray experiments, RNA sequencing, and single-cell RNA sequencing. Our first finding suggests that the MVN assumption is not always satisfied. This assumption does not hold true in many applications tested here. In the second part of this research, we evaluated the influence of non-normality on the statistical power of current FCS methods, both parametric and nonparametric ones. Specifically, the scenario of mixture distributions representing more than one population for the RNA values was considered. This second investigation demonstrates that the non-normality distribution of the RNA values causes a loss in the statistical power of these GSA tests, especially when subtypes exist. Among the FCS GSA tools examined here and among the scenarios studied in this research, the N-statistics outperform the others. Based on the results from these two investigations, we conclude that the assumption of MVN should be used with caution when evaluating new GSA tools, since this assumption cannot be guaranteed and violation may lead to spurious results, loss of power, and incorrect comparison between methods. If a newly proposed GSA tool is to be evaluated, we recommend the incorporation of a wide range of multivariate non-normal distributions or sampling from large databases if available.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] The misuse of distributional assumptions in functional class scoring gene-set and pathway analysis
    Ho, Chi-Hsuan
    Huang, Yu-Jyun
    Lai, Ying-Ju
    Mukherjee, Rajarshi
    Hsiao, Chuhsing Kate
    G3-GENES GENOMES GENETICS, 2022, 12 (01):
  • [2] Gene-set distance analysis (GSDA): a powerful tool for gene-set association analysis
    Cao, Xueyuan
    Pounds, Stan
    BMC BIOINFORMATICS, 2021, 22 (01)
  • [3] Gene-set distance analysis (GSDA): a powerful tool for gene-set association analysis
    Xueyuan Cao
    Stan Pounds
    BMC Bioinformatics, 22
  • [4] GENE-SET ANALYSIS REVEALS 56 FUNCTIONAL PATHWAYS FOR OSTEONECROSIS
    Tian, W.
    Yao, S.
    Guo, Y.
    OSTEOPOROSIS INTERNATIONAL, 2020, 31 (SUPPL 1) : S295 - S295
  • [5] Gene-set analysis and reduction
    Dinu, Irina
    Potter, John D.
    Mueller, Thomas
    Liu, Qi
    Adewale, Adeniyi J.
    Jhangri, Gian S.
    Einecke, Gunilla
    Famulski, Konrad S.
    Halloran, Philip
    Yasui, Yutaka
    BRIEFINGS IN BIOINFORMATICS, 2009, 10 (01) : 24 - 34
  • [6] The statistical properties of gene-set analysis
    de Leeuw, Christiaan A.
    Neale, Benjamin M.
    Heskes, Tom
    Posthuma, Danielle
    NATURE REVIEWS GENETICS, 2016, 17 (06) : 353 - 364
  • [7] The statistical properties of gene-set analysis
    Christiaan A. de Leeuw
    Benjamin M. Neale
    Tom Heskes
    Danielle Posthuma
    Nature Reviews Genetics, 2016, 17 : 353 - 364
  • [8] A Shrinkage Approach to Gene-Set Analysis
    Parks, Daniel C.
    Lin, Xiwu
    Parks, Joshua J.
    Menius, J. Alan
    Lee, Kwan R.
    STATISTICS IN BIOPHARMACEUTICAL RESEARCH, 2011, 3 (04): : 506 - 514
  • [9] GENERALIZED GENE AND GENE-SET ANALYSIS OF GWAS DATA REVEALS FUNCTIONAL PATHWAYS FOR OSTEOARTHRITIS
    Peng, B.
    Shi, Y.
    OSTEOPOROSIS INTERNATIONAL, 2020, 31 (SUPPL 1) : S171 - S172
  • [10] GENE AND GENE-SET ANALYSIS REVEALS 10 GENES AND 24 FUNCTIONAL PATHWAYS FOR OSTEOMYELITIS
    Tian, W.
    Yao, S.
    Guo, Y.
    OSTEOPOROSIS INTERNATIONAL, 2020, 31 (SUPPL 1) : S295 - S296