Selecting Classification Methods for Small Samples of Next-Generation Sequencing Data

被引:0
|
作者
Zhu, Jiadi [1 ]
Yuan, Ziyang [2 ]
Shu, Lianjie [3 ]
Liao, Wenhui [4 ]
Zhao, Mingtao [5 ]
Zhou, Yan [2 ]
机构
[1] Xidian Univ, Dept Math & Stat, Xian, Peoples R China
[2] Shenzhen Univ, Inst Stat Sci, Coll Math & Stat, Shenzhen Key Lab Adv Machine Learning & Applicat, Shenzhen, Peoples R China
[3] Univ Macau, Fac Business Adm, Macau, Peoples R China
[4] GuangDong Univ Finance, Guangzhou, Peoples R China
[5] Anhui Univ Finance & Econ, Inst Stat & Appl Math, Bengbu, Peoples R China
基金
中国国家自然科学基金;
关键词
RNA-seq data; classification; PLDA; NBLDA; ZIPLDA; ZINBLDA;
D O I
10.3389/fgene.2021.642227
中图分类号
Q3 [遗传学];
学科分类号
071007 ; 090102 ;
摘要
Next-generation sequencing has emerged as an essential technology for the quantitative analysis of gene expression. In medical research, RNA sequencing (RNA-seq) data are commonly used to identify which type of disease a patient has. Because of the discrete nature of RNA-seq data, the existing statistical methods that have been developed for microarray data cannot be directly applied to RNA-seq data. Existing statistical methods usually model RNA-seq data by a discrete distribution, such as the Poisson, the negative binomial, or the mixture distribution with a point mass at zero and a Poisson distribution to further allow for data with an excess of zeros. Consequently, analytic tools corresponding to the above three discrete distributions have been developed: Poisson linear discriminant analysis (PLDA), negative binomial linear discriminant analysis (NBLDA), and zero-inflated Poisson logistic discriminant analysis (ZIPLDA). However, it is unclear what the real distributions would be for these classifications when applied to a new and real dataset. Considering that count datasets are frequently characterized by excess zeros and overdispersion, this paper extends the existing distribution to a mixture distribution with a point mass at zero and a negative binomial distribution and proposes a zero-inflated negative binomial logistic discriminant analysis (ZINBLDA) for classification. More importantly, we compare the above four classification methods from the perspective of model parameters, as an understanding of parameters is necessary for selecting the optimal method for RNA-seq data. Furthermore, we determine that the above four methods could transform into each other in some cases. Using simulation studies, we compare and evaluate the performance of these classification methods in a wide range of settings, and we also present a decision tree model created to help us select the optimal classifier for a new RNA-seq dataset. The results of the two real datasets coincide with the theory and simulation analysis results. The methods used in this work are implemented in the open-scource R scripts, with a source code freely available at https://github.com/FocusPaka/ZINBLDA.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Big data from small samples: Informatics of next-generation sequencing in cytopathology
    Roy-Chowdhuri, Sinchita
    Roy, Somak
    Monaco, Sara E.
    Routbort, Mark J.
    Pantanowitz, Liron
    [J]. CANCER CYTOPATHOLOGY, 2017, 125 (04) : 236 - 244
  • [2] Computational classification of microRNAs in next-generation sequencing data
    Riback, Joshua
    Hatzigeorgiou, Artemis G.
    Reczko, Martin
    [J]. THEORETICAL CHEMISTRY ACCOUNTS, 2010, 125 (3-6) : 637 - 642
  • [3] Computational classification of microRNAs in next-generation sequencing data
    Joshua Riback
    Artemis G. Hatzigeorgiou
    Martin Reczko
    [J]. Theoretical Chemistry Accounts, 2010, 125 : 637 - 642
  • [4] NGSNGS: next-generation simulator for next-generation sequencing data
    Henriksen, Rasmus Amund
    Zhao, Lei
    Korneliussen, Thorfinn Sand
    [J]. BIOINFORMATICS, 2023, 39 (01)
  • [5] A comparison of next-generation sequencing analysis methods for cancer xenograft samples
    Dai, Wentao
    Liu, Jixiang
    Li, Quanxue
    Liu, Wei
    Li, Yi-Xue
    Li, Yuan-Yuan
    [J]. JOURNAL OF GENETICS AND GENOMICS, 2018, 45 (07) : 345 - 350
  • [6] A comparison of next-generation sequencing analysis methods for cancer xenograft samples
    Wentao Dai
    Jixiang Liu
    Quanxue Li
    Wei Liu
    Yi-Xue Li
    Yuan-Yuan Li
    [J]. Journal of Genetics and Genomics, 2018, 45 (07) : 345 - 350
  • [7] Bioinformatics Methods and Biological Interpretation for Next-Generation Sequencing Data
    Wang, Guohua
    Liu, Yunlong
    Zhu, Dongxiao
    Klau, Gunnar W.
    Feng, Weixing
    [J]. BIOMED RESEARCH INTERNATIONAL, 2015, 2015
  • [8] Discriminant Analysis and Normalization Methods for Next-Generation Sequencing Data
    Zhou, Yan
    Wang, Junhui
    Zhao, Yichuan
    Tong, Tiejun
    [J]. NEW FRONTIERS OF BIOSTATISTICS AND BIOINFORMATICS, 2018, : 365 - 384
  • [9] Next-generation DNA sequencing methods
    Mardis, Elaine R.
    [J]. ANNUAL REVIEW OF GENOMICS AND HUMAN GENETICS, 2008, 9 : 387 - 402
  • [10] Indexing Next-Generation Sequencing data
    Jalili, Vahid
    Matteucci, Matteo
    Masseroli, Marco
    Ceri, Stefano
    [J]. INFORMATION SCIENCES, 2017, 384 : 90 - 109