Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers

被引:7
|
作者
Khan, Sumeer Ahmad [1 ,2 ]
Maillo, Alberto [1 ]
Lagani, Vincenzo [1 ,2 ,3 ]
Lehmann, Robert [1 ]
Kiani, Narsis A. [4 ,5 ]
Gomez-Cabrero, David [1 ,6 ]
Tegner, Jesper [1 ,5 ,7 ,8 ]
机构
[1] King Abdullah Univ Sci & Technol KAUST, Biol & Environm Sci & Engn Div, Thuwal, Saudi Arabia
[2] SDAIA KAUST Ctr Excellence Data Sci & Artificial I, Thuwal, Saudi Arabia
[3] Ilia State Univ, Inst Chem Biol, Tbilisi, Georgia
[4] Karolinska Inst, Dept Oncol & Pathol, Algorithm Dynam Lab, Stockholm, Sweden
[5] Karolinska Univ Hosp, Karolinska Inst, Ctr Mol Med, Dept Med,Unit Comp Med, Stockholm, Sweden
[6] Univ Publ Navarra UPNA, IdiSNA, Navarrabiomed, Translat Bioinformat Unit, Pamplona, Spain
[7] King Abdullah Univ Sci & Technol KAUST, Comp Elect & Math Sci & Engn Div, Thuwal, Saudi Arabia
[8] Sci Life Lab, Solna, Sweden
关键词
Cell types - Data driven - Embeddings - Genomic data - Genomics - Language processing - Machine learning algorithms - Natural languages - Pre-training - Single cells;
D O I
10.1038/s42256-023-00757-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rise of single-cell genomics is an attractive opportunity for data-hungry machine learning algorithms. The scBERT method, inspired by the success of BERT ('bidirectional encoder representations from transformers') in natural language processing, was recently introduced by Yang et al. as a data-driven tool to annotate cell types in single-cell genomics data. Analogous to contextual embedding in BERT, scBERT leverages pretraining and self-attention mechanisms to learn the 'transcriptional grammar' of cells. Here we investigate the reusability beyond the original datasets, assessing the generalizability of natural language techniques in single-cell genomics. The degree of imbalance in the cell-type distribution substantially influences the performance of scBERT. Anticipating an increased utilization of transformers, we highlight the necessity to consider data distribution carefully and introduce a subsampling technique to mitigate the influence of an imbalanced distribution. Our analysis serves as a stepping stone towards understanding and optimizing the use of transformers in single-cell genomics. scBERT, a pretrained neural network for single-cell sequencing tasks, was published last year in Nature Machine Intelligence. To test the reusability of the method, Khan et al. use the code to assess the generalizablility of transformer architectures on single-cell genomics tasks.
引用
收藏
页码:1437 / 1446
页数:13
相关论文
共 50 条
  • [31] RESCUE: imputing dropout events in single-cell RNA-sequencing data
    Tracy, Sam
    Yuan, Guo-Cheng
    Dries, Ruben
    BMC BIOINFORMATICS, 2019, 20 (1)
  • [32] RESCUE: imputing dropout events in single-cell RNA-sequencing data
    Sam Tracy
    Guo-Cheng Yuan
    Ruben Dries
    BMC Bioinformatics, 20
  • [33] DoubletDecon: Deconvoluting Doublets from Single-Cell RNA-Sequencing Data
    DePasquale, Erica A. K.
    Schnell, Daniel J.
    Van Camp, Pieter-Jan
    Valiente-Alandi, Inigo
    Blaxall, Burns C.
    Grimes, H. Leighton
    Singh, Harinder
    Salomonis, Nathan
    CELL REPORTS, 2019, 29 (06): : 1718 - +
  • [34] EnImpute: imputing dropout events in single-cell RNA-sequencing data via ensemble learning
    Zhang, Xiao-Fei
    Le Ou-Yang
    Shuo Yang
    Zhao, Xing-Ming
    Hu, Xiaohua
    Hong Yan
    BIOINFORMATICS, 2019, 35 (22) : 4827 - 4829
  • [35] scDEA: differential expression analysis in single-cell RNA-sequencing data via ensemble learning
    Li, Hui-Sheng
    Le Ou-Yang
    Yuan Zhu
    Hong Yan
    Zhang, Xiao-Fei
    BRIEFINGS IN BIOINFORMATICS, 2022, 23 (01)
  • [36] Deep learning-based advances and applications for single-cell RNA-sequencing data analysis
    Bao, Siqi
    Li, Ke
    Yan, Congcong
    Zhang, Zicheng
    Qu, Jia
    Zhou, Meng
    BRIEFINGS IN BIOINFORMATICS, 2022, 23 (01)
  • [37] Reusability report: Leveraging supervised learning to uncover phenotype-relevant biology from single-cell RNA sequencing data
    Cao, Yingying
    Chang, Tian-Gen
    Sahni, Sahil
    Ruppin, Eytan
    NATURE MACHINE INTELLIGENCE, 2024, 6 (03) : 307 - 314
  • [38] Reusability report: Leveraging supervised learning to uncover phenotype-relevant biology from single-cell RNA sequencing data
    Yingying Cao
    Tian-Gen Chang
    Sahil Sahni
    Eytan Ruppin
    Nature Machine Intelligence, 2024, 6 : 307 - 314
  • [39] Combining bulk RNA-sequencing and single-cell RNA-sequencing data to reveal the immune microenvironment and metabolic pattern of osteosarcoma
    Huang, Ruichao
    Wang, Xiaohu
    Yin, Xiangyun
    Zhou, Yaqi
    Sun, Jiansheng
    Yin, Zhongxiu
    Zhu, Zhi
    FRONTIERS IN GENETICS, 2022, 13
  • [40] Constructing Simulation Data with Dependence Structure for Unreliable Single-Cell RNA-Sequencing Data Using Copulas
    Fuetterer, Cornelia
    Schollmeyer, Georg
    Augustin, Thomas
    PROCEEDINGS OF THE ELEVENTH INTERNATIONAL SYMPOSIUM ON IMPRECISE PROBABILITIES: THEORIES AND APPLICATIONS (ISIPTA 2019), 2019, 103 : 216 - 224