Approximate distance correlation for selecting highly interrelated genes across datasets

被引:3
|
作者
Shen, Qunlun [1 ,2 ]
Zhang, Shihua [1 ,2 ,3 ,4 ]
机构
[1] Chinese Acad Sci, Acad Math & Syst Sci, RCSDS, CEMS,NCMIS, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Math Sci, Beijing, Peoples R China
[3] Chinese Acad Sci, Ctr Excellence Anim Evolut & Genet, Kunming, Yunnan, Peoples R China
[4] Chinese Acad Sci, Univ Chinese Acad Sci, Hangzhou Inst Adv Study, Key Lab Syst Biol, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
CELL RNA-SEQ; EXPRESSION; PREDICTION; DISCOVERY; CANCER; ATLAS;
D O I
10.1371/journal.pcbi.1009548
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
With the rapid accumulation of biological omics datasets, decoding the underlying relationships of cross-dataset genes becomes an important issue. Previous studies have attempted to identify differentially expressed genes across datasets. However, it is hard for them to detect interrelated ones. Moreover, existing correlation-based algorithms can only measure the relationship between genes within a single dataset or two multi-modal datasets from the same samples. It is still unclear how to quantify the strength of association of the same gene across two biological datasets with different samples. To this end, we propose Approximate Distance Correlation (ADC) to select interrelated genes with statistical significance across two different biological datasets. ADC first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets. ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. We demonstrate the effectiveness of ADC with simulation data and four real applications to select highly interrelated genes across two datasets. These four applications including 21 cancer RNA-seq datasets of different tissues; six single-cell RNA-seq (scRNA-seq) datasets of mouse hematopoietic cells across six different cell types along the hematopoietic cell lineage; five scRNA-seq datasets of pancreatic islet cells across five different technologies; coupled single-cell ATAC-seq (scATAC-seq) and scRNA-seq data of peripheral blood mononuclear cells (PBMC). Extensive results demonstrate that ADC is a powerful tool to uncover interrelated genes with strong biological implications and is scalable to large-scale datasets. Moreover, the number of such genes can serve as a metric to measure the similarity between two datasets, which could characterize the relative difference of diverse cell types and technologies. Author summaryThe number and size of biological datasets (e.g., single-cell RNA-seq datasets) are booming recently. How to mine the relationships of genes across datasets is becoming an important issue. Computational tools of identifying differentially expressed genes have been comprehensively studied, but the interrelated genes across datasets are always neglected. Detecting of highly interrelated genes across datasets is hindered because the samples of them are always different and they could have different numbers of samples. To solve this problem, we present a new algorithm that can identify interrelated genes across datasets based on distance correlation. Our proposed algorithm is very efficient and works well in different technologies, i.e., RNA-seq, single-cell RNA-seq and single-cell ATAC-seq. Also, we found that the number of such highly interrelated genes can serve as a metric to measure the similarity between two datasets, which could characterize the relative difference of diverse cell types and technologies.
引用
收藏
页数:18
相关论文
共 41 条
  • [21] Genetic isolation by distance reveals restricted dispersal across a range of life histories: implications for biodiversity conservation planning across highly variable marine environments
    Wright, Daniel
    Bishop, Jacqueline M.
    Matthee, Conrad A.
    von der Heyden, Sophie
    DIVERSITY AND DISTRIBUTIONS, 2015, 21 (06) : 698 - 710
  • [22] Identification of genes highly downregulated in pancreatic cancer through a meta-analysis of microarray datasets: implications for discovery of novel tumor-suppressor genes and therapeutic targets
    Nalin C. W. Goonesekere
    Wyatt Andersen
    Alex Smith
    Xiaosheng Wang
    Journal of Cancer Research and Clinical Oncology, 2018, 144 : 309 - 320
  • [23] Identification of genes highly downregulated in pancreatic cancer through a meta-analysis of microarray datasets: implications for discovery of novel tumor-suppressor genes and therapeutic targets
    Goonesekere, Nalin C. W.
    Andersen, Wyatt
    Smith, Alex
    Wang, Xiaosheng
    JOURNAL OF CANCER RESEARCH AND CLINICAL ONCOLOGY, 2018, 144 (02) : 309 - 320
  • [24] MapCell: Learning a Comparative Cell Type Distance Metric With Siamese Neural Nets With Applications Toward Cell-Type Identification Across Experimental Datasets
    Koh, Winston
    Hoon, Shawn
    FRONTIERS IN CELL AND DEVELOPMENTAL BIOLOGY, 2021, 9
  • [25] Transcriptional response to cardiac injury in the zebrafish: systematic identification of genes with highly concordant activity across in vivo models
    Rodius, Sophie
    Nazarov, Petr V.
    Nepomuceno-Chamorro, Isabel A.
    Jeanty, Celine
    Gonzalez-Rosa, Juan Manuel
    Ibberson, Mark
    da Costa, Ricardo M. Benites
    Xenarios, Ioannis
    Mercader, Nadia
    Azuaje, Francisco
    BMC GENOMICS, 2014, 15
  • [26] Identification of Highly Methylated Genes across Various Types of B-Cell Non-Hodgkin Lymphoma
    Bethge, Nicole
    Honne, Hilde
    Hilden, Vera
    Troen, Gunhild
    Eknaes, Mette
    Liestol, Knut
    Holte, Harald
    Delabie, Jan
    Smeland, Erlend B.
    Lind, Guro E.
    PLOS ONE, 2013, 8 (11):
  • [27] Transcriptional response to cardiac injury in the zebrafish: systematic identification of genes with highly concordant activity across in vivo models
    Sophie Rodius
    Petr V Nazarov
    Isabel A Nepomuceno-Chamorro
    Céline Jeanty
    Juan Manuel González-Rosa
    Mark Ibberson
    Ricardo M Benites da Costa
    Ioannis Xenarios
    Nadia Mercader
    Francisco Azuaje
    BMC Genomics, 15
  • [28] A Systematic Review of Genotype-Phenotype Correlation across Cohorts Having Causal Mutations of Different Genes in ALS
    Connolly, Owen
    Le Gall, Laura
    McCluskey, Gavin
    Donaghy, Colette G.
    Duddy, William J.
    Duguez, Stephanie
    JOURNAL OF PERSONALIZED MEDICINE, 2020, 10 (03): : 1 - 27
  • [29] Comparative Correlation Structure of Colon Cancer Locus Specific Methylation: Characterisation of Patient Profiles and Potential Markers across 3 Array-Based Datasets
    Barat, Ana
    Ruskin, Heather J.
    JOURNAL OF CANCER, 2015, 6 (08): : 795 - 811
  • [30] Transcription-dependent spreading of the Dal80 yeast GATA factor across the body of highly expressed genes
    Ronsmans, Aria
    Wery, Maxime
    Szachnowski, Ugo
    Gautier, Camille
    Descrimes, Marc
    Dubois, Evelyne
    Morillon, Antonin
    Georis, Isabelle
    PLOS GENETICS, 2019, 15 (02):