Approximate distance correlation for selecting highly interrelated genes across datasets

被引:3
|
作者
Shen, Qunlun [1 ,2 ]
Zhang, Shihua [1 ,2 ,3 ,4 ]
机构
[1] Chinese Acad Sci, Acad Math & Syst Sci, RCSDS, CEMS,NCMIS, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Math Sci, Beijing, Peoples R China
[3] Chinese Acad Sci, Ctr Excellence Anim Evolut & Genet, Kunming, Yunnan, Peoples R China
[4] Chinese Acad Sci, Univ Chinese Acad Sci, Hangzhou Inst Adv Study, Key Lab Syst Biol, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
CELL RNA-SEQ; EXPRESSION; PREDICTION; DISCOVERY; CANCER; ATLAS;
D O I
10.1371/journal.pcbi.1009548
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
With the rapid accumulation of biological omics datasets, decoding the underlying relationships of cross-dataset genes becomes an important issue. Previous studies have attempted to identify differentially expressed genes across datasets. However, it is hard for them to detect interrelated ones. Moreover, existing correlation-based algorithms can only measure the relationship between genes within a single dataset or two multi-modal datasets from the same samples. It is still unclear how to quantify the strength of association of the same gene across two biological datasets with different samples. To this end, we propose Approximate Distance Correlation (ADC) to select interrelated genes with statistical significance across two different biological datasets. ADC first obtains the k most correlated genes for each target gene as its approximate observations, and then calculates the distance correlation (DC) for the target gene across two datasets. ADC repeats this process for all genes and then performs the Benjamini-Hochberg adjustment to control the false discovery rate. We demonstrate the effectiveness of ADC with simulation data and four real applications to select highly interrelated genes across two datasets. These four applications including 21 cancer RNA-seq datasets of different tissues; six single-cell RNA-seq (scRNA-seq) datasets of mouse hematopoietic cells across six different cell types along the hematopoietic cell lineage; five scRNA-seq datasets of pancreatic islet cells across five different technologies; coupled single-cell ATAC-seq (scATAC-seq) and scRNA-seq data of peripheral blood mononuclear cells (PBMC). Extensive results demonstrate that ADC is a powerful tool to uncover interrelated genes with strong biological implications and is scalable to large-scale datasets. Moreover, the number of such genes can serve as a metric to measure the similarity between two datasets, which could characterize the relative difference of diverse cell types and technologies. Author summaryThe number and size of biological datasets (e.g., single-cell RNA-seq datasets) are booming recently. How to mine the relationships of genes across datasets is becoming an important issue. Computational tools of identifying differentially expressed genes have been comprehensively studied, but the interrelated genes across datasets are always neglected. Detecting of highly interrelated genes across datasets is hindered because the samples of them are always different and they could have different numbers of samples. To solve this problem, we present a new algorithm that can identify interrelated genes across datasets based on distance correlation. Our proposed algorithm is very efficient and works well in different technologies, i.e., RNA-seq, single-cell RNA-seq and single-cell ATAC-seq. Also, we found that the number of such highly interrelated genes can serve as a metric to measure the similarity between two datasets, which could characterize the relative difference of diverse cell types and technologies.
引用
收藏
页数:18
相关论文
共 41 条
  • [1] Selecting Relevant Genes From Microarray Datasets Using a Random Forest Model
    Xia, Hui
    Akay, Yasemin M.
    Akay, Metin
    IEEE ACCESS, 2021, 9 : 97813 - 97821
  • [2] Identification of Druggable Cancer Driver Genes Amplified across TCGA Datasets
    Chen, Ying
    McGee, Jeremy
    Chen, Xianming
    Doman, Thompson N.
    Gong, Xueqian
    Zhang, Youyan
    Hamm, Nicole
    Ma, Xiwen
    Higgs, Richard E.
    Bhagwat, Shripad V.
    Buchanan, Sean
    Peng, Sheng-Bin
    Staschke, Kirk A.
    Yadav, Vipin
    Yue, Yong
    Kouros-Mehr, Hosein
    PLOS ONE, 2014, 9 (05):
  • [3] A hierarchical two-phase framework for selecting genes in cancer datasets with a neuro-fuzzy system
    Lim, Jongwoo
    Wang, Bohyun
    Lim, Joon S.
    TECHNOLOGY AND HEALTH CARE, 2016, 24 : S601 - S605
  • [4] Automated analysis of immunosequencing datasets reveals novel immunoglobulin D genes across diverse species
    Bhardwaj, Vinnu
    Franceschetti, Massimo
    Rao, Ramesh
    Pevzner, Pavel A.
    Safonova, Yana
    PLOS COMPUTATIONAL BIOLOGY, 2020, 16 (04)
  • [5] Distance-decay equations of antibiotic resistance genes across freshwater reservoirs
    Guo, Zhao-Feng
    Das, Kiranmoy
    Boeing, Wiebke J.
    Xu, Yao-Yang
    Borgomeo, Edoardo
    Zhang, Dong
    Ao, Si-Cheng
    Yang, Xiao-Ru
    WATER RESEARCH, 2024, 258
  • [6] The evolution of highly variable immunity genes across a passerine bird radiation
    O'Connor, E. A.
    Strandh, M.
    Hasselquist, D.
    Nilsson, J. -A.
    Westerdahl, H.
    MOLECULAR ECOLOGY, 2016, 25 (04) : 977 - 989
  • [7] Capturing protein-coding genes across highly divergent species
    Li, Chenhong
    Hofreiter, Michael
    Straube, Nicolas
    Corrigan, Shannon
    Naylor, Gavin J. P.
    BIOTECHNIQUES, 2013, 54 (06) : 321 - +
  • [8] Gene Duplicability of Core Genes Is Highly Consistent across All Angiosperms
    Li, Zhen
    Defoort, Jonas
    Tasdighian, Setareh
    Maere, Steven
    Van de Peer, Yves
    De Smet, Riet
    PLANT CELL, 2016, 28 (02): : 326 - 344
  • [9] Chickpea rhizobia symbiosis genes are highly conserved across multiple Mesorhizobium species
    Laranjo, Marta
    Alexandre, Ana
    Rivas, Raul
    Velazquez, Encarna
    Young, J. Peter W.
    Oliveira, Solange
    FEMS MICROBIOLOGY ECOLOGY, 2008, 66 (02) : 391 - 400
  • [10] Identification of Salt-Sensitive and Salt-Tolerant Genes through Weighted Gene Co-Expression Networks across Multiple Datasets: A Centralization and Differential Correlation Analysis
    Sonsungsan, Pajaree
    Suratanee, Apichat
    Buaboocha, Teerapong
    Chadchawan, Supachitra
    Plaimas, Kitiporn
    GENES, 2024, 15 (03)