Network Regularized Bi-Clustering for Cancer Subtype Categorization

被引:0
|
作者
Wang X. [1 ]
Wang J. [1 ]
Yu G.-X. [1 ]
Guo M.-Z. [2 ,3 ]
机构
[1] College of Computer and Information Science, Southwest University, Chongqing
[2] College of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing
[3] Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing
来源
基金
中国国家自然科学基金;
关键词
Bi-clustering; Cancer subtypes; Gene networks; Nonnegative matrix factorization; Sum-squared residue;
D O I
10.11897/SP.J.1016.2019.01274
中图分类号
学科分类号
摘要
Cancer subtype identification is crucial for understanding tumor heterogeneity. Existing methods for identifying cancer subtypes have primarily focused on utilizing traditional clustering algorithms (such as k-means and hierarchical clustering) to cluster gene expression data and thus to identify subtypes. These traditional approaches, however, separately group the data from genes or samples dimension only, so they cannot discover the patterns that similar genes exhibit similar behaviors only over a subset of conditions (or samples). Bi-clustering can simultaneously group large scale gene expression data from sample and gene dimensions, and find out bi-clusters that relevant samples exhibit similar gene expression profiles over a subset of genes, and thus to identify corresponding cancer subtypes. The discovered bi-clusters bring insights for categorizing cancer subtypes and precise gene treatments. Incorporating the information of gene-gene interaction networks can further improve the quality of the discovered bi-clusters. However, current efforts generally use the networks to weight and select genes. They are often interfered by noisy interactions and misled by missing interactions. There are many types of bi-clusters, including constant bi-cluster, constant row bi-cluster, constant column bi-cluster, coherent values additive bi-cluster and coherent value multiplicative bi-cluster. To address these limitations and explore multiple types of bi-clusters, in this paper, we introduce a gene-gene interaction Network Regularized Bi-Clustering algorithm (NetRBC) based on the Semi-Nonnegative Matrix Tri-Factorization (SNMTF). NetRBC firstly integrates the mean square residuals into SNMFT, and optimizes the gene-cluster and sample-cluster indicator matrices via minimizing the sum-squared loss of the discovered bi-clusters. Next, it constructs a graph regularization term by using the gene networks and gene-cluster indicator matrix. The core idea of the regularization term is that if a pair of genes interact with each other, these genes may co-regulate the production of one cancer subtype, so we except that these genes can be grouped into the same bi-clusters. After that, NetRBC incorporates the regularization term into a sum-squared loss based SNMTF to guide the collaborative factorization and thus to pursue gene-cluster indicator matrix and sample-cluster indicator matrix, and thus to improve the accuracy of cancer subtypes categorization. At the same time, NetRBC uses a regularization parameter to control the contribution of gene-gene interaction network. We also give an optimization technique to optimize the gene-cluster and sample-cluster indicator matrices, which uses the multiplicative updating technique to alternatively optimize one variable, while fixing the other variables, until convergence. We conduct experiments on six cancer gene expression datasets with known subtypes to comparatively study the performance of NetRBC. We further test NetRBC on two large-scale cancer gene expression datasets from The Cancer Genome Atlas (TCGA) project and use the clinical features of patients to evaluate the performance, since the true subtypes of these samples belonging to are unknown. Extensive experimental results show that NetRBC can better group patients into subtypes than competitive comparing methods, and the proposed network regularization term indeed significantly improves the cancer subtype categorization accuracy. © 2019, Science Press. All right reserved.
引用
收藏
页码:1274 / 1288
页数:14
相关论文
共 51 条
  • [1] Perou C.M., Sorlie T., Eisen M.B., Et al., Molecular portraits of human breast tumours, Nature, 406, 6797, pp. 747-752, (2000)
  • [2] Xu T.-S., Research on Clustering Analysis of Cancer Subtypes Based on Genomics Data, (2016)
  • [3] MacQueen J., Some methods for classification and analysis of multivariate observations, Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, pp. 281-297, (1967)
  • [4] Johnson S.C., Hierarchical clustering schemes, Psychometrika, 32, 3, pp. 241-254, (1967)
  • [5] Cheng Y., Church G.M., Biclustering of expression data, Proceedings of the International Conference on Intelligent Systems for Molecular Biology, pp. 93-103, (2000)
  • [6] Barabasi A.L., Gulbahce N., Loscalzo J., Network medicine: A network-based approach to human disease, Nature Reviews Genetics, 12, 1, pp. 56-68, (2011)
  • [7] Hwang T.H., Atluri G., Xie M.Q., Et al., Co-clustering phenome-Genome for phenotype classification and disease gene discovery, Nucleic Acids Research, 40, 19, (2012)
  • [8] Wang H., Nie F., Huang H., Et al., Fast nonnegative matrix tri-factorization for large-scale data co-clustering, Proceedings of the International Joint Conference on Artificial Intelligence, pp. 1553-1558, (2011)
  • [9] Phillips H.S., Kharbanda S., Chen R., Et al., Molecular subclasses of high-grade glioma predict prognosis, delineate a pattern of disease progression, and resemble stages in neurogenesis, Cancer Cell, 9, 3, pp. 157-173, (2006)
  • [10] Monti S., Tamayo P., Mesirov J., Et al., Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data, Machine Learning, 52, 1, pp. 91-118, (2003)