K-nearest-neighbors induced topological PCA for single cell RNA-sequence data analysis

被引:3
|
作者
Cottrell S. [1 ]
Hozumi Y. [1 ]
Wei G.-W. [1 ,2 ,3 ]
机构
[1] Department of Mathematics, Michigan State University, East Lansing, 48824, MI
[2] Department of Electrical and Computer Engineering, Michigan State University, East Lansing, 48824, MI
[3] Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, 48824, MI
基金
美国国家航空航天局; 美国国家科学基金会; 美国国家卫生研究院;
关键词
Clustering; Dimensionality reduction; Machine learning; Persistent homology; Persistent Laplacian; scRNA-seq; Topology;
D O I
10.1016/j.compbiomed.2024.108497
中图分类号
学科分类号
摘要
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell–cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Traditional PCA, a main workhorse in dimensionality reduction, lacks the ability to capture geometrical structure information embedded in the data, and previous graph Laplacian regularizations are limited by the analysis of only a single scale. We propose a topological Principal Components Analysis (tPCA) method by the combination of persistent Laplacian (PL) technique and L2,1 norm regularization to address multiscale and multiclass heterogeneity issues in data. We further introduce a k-Nearest-Neighbor (kNN) persistent Laplacian technique to improve the robustness of our persistent Laplacian method. The proposed kNN-PL is a new algebraic topology technique which addresses the many limitations of the traditional persistent homology. Rather than inducing filtration via the varying of a distance threshold, we introduced kNN-tPCA, where filtrations are achieved by varying the number of neighbors in a kNN network at each step, and find that this framework has significant implications for hyper-parameter tuning. We validate the efficacy of our proposed tPCA and kNN-tPCA methods on 11 diverse benchmark scRNA-seq datasets, and showcase that our methods outperform other unsupervised PCA enhancements from the literature, as well as popular Uniform Manifold Approximation (UMAP), t-Distributed Stochastic Neighbor Embedding (tSNE), and Projection Non-Negative Matrix Factorization (NMF) by significant margins. For example, tPCA provides up to 628%, 78%, and 149% improvements to UMAP, tSNE, and NMF, respectively on classification in the F1 metric, and kNN-tPCA offers 53%, 63%, and 32% improvements to UMAP, tSNE, and NMF, respectively on clustering in the ARI metric. © 2024 Elsevier Ltd
引用
收藏
相关论文
共 50 条
  • [21] The k-nearest neighbors method in single index regression model for functional quasi-associated time series data
    Salim Bouzebda
    Ali Laksaci
    Mustapha Mohammedi
    Revista Matemática Complutense, 2023, 36 : 361 - 391
  • [22] The k-nearest neighbors method in single index regression model for functional quasi-associated time series data
    Bouzebda, Salim
    Laksaci, Ali
    Mohammedi, Mustapha
    REVISTA MATEMATICA COMPLUTENSE, 2023, 36 (02): : 361 - 391
  • [23] Study of selected methods for balancing independent data sets in k-nearest neighbors classifiers with Pawlak conflict analysis
    Przybyla-Kasperek, Malgorzata
    APPLIED SOFT COMPUTING, 2022, 129
  • [24] Deep soft K-means clustering with self-training for single-cell RNA sequence data
    Chen, Liang
    Wang, Weinan
    Zhai, Yuyao
    Deng, Minghua
    NAR GENOMICS AND BIOINFORMATICS, 2020, 2 (02)
  • [25] Topological and geometric analysis of cell states in single-cell transcriptomic data
    Huynh, Tram
    Cang, Zixuan
    BRIEFINGS IN BIOINFORMATICS, 2024, 25 (03)
  • [26] Immunological Response to Single Pathogen Challenge with Agents of the Bovine Respiratory Disease Complex: An RNA-Sequence Analysis of the Bronchial Lymph Node Transcriptome
    Tizioto, Polyana C.
    Kim, JaeWoo
    Seabury, Christopher M.
    Schnabel, Robert D.
    Gershwin, Laurel J.
    Van Eenennaam, Alison L.
    Toaff-Rosenstein, Rachel
    Neibergs, Holly L.
    Taylor, Jeremy F.
    PLOS ONE, 2015, 10 (06):
  • [27] iEMNN: An Iterative Integration Method for Single-Cell Transcriptomic Data Based on Network Similarity Enhancement and Mutual Nearest Neighbors
    Lin, Xuesheng
    Jiang, Yusheng
    Guan, Jinting
    ADVANCED INTELLIGENT COMPUTING IN BIOINFORMATICS, PT II, ICIC 2024, 2024, 14882 : 201 - 211
  • [28] BIG DATA ANALYSIS IN A GEOINFORMATIC PROBLEM OF SHORT-TERM TRAFFIC FLOW FORECASTING BASED ON A K NEAREST NEIGHBORS METHOD
    Agafonov, A. A.
    Yumaganov, A. S.
    Myasnikov, V. V.
    COMPUTER OPTICS, 2018, 42 (06) : 1101 - 1111
  • [29] A Novel Approach to Single Cell RNA-Sequence Analysis Facilitates In Silico Gene Reporting of Human Pluripotent Stem Cell-Derived Retinal Cell Types (vol 36, pg 3, 2018)
    Ide, Kanako
    Mitsui, Kaoru
    Irie, Rie
    Matsushita, Yohei
    Ijichi, Nobuhiro
    Toyodome, Soichiro
    Kosai, Ken-Ichiro
    STEM CELLS, 2018, 36 (07) : 1133 - 1133
  • [30] A meta-analysis and review of the literature on the k-Nearest Neighbors technique for forestry applications that use remotely sensed data
    Chirici, Gherardo
    Mura, Matteo
    McInerney, Daniel
    Py, Nicolas
    Tomppo, Erkki O.
    Waser, Lars T.
    Travaglini, Davide
    McRoberts, Ronald E.
    REMOTE SENSING OF ENVIRONMENT, 2016, 176 : 282 - 294