Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers

被引:7
|
作者
Khan, Sumeer Ahmad [1 ,2 ]
Maillo, Alberto [1 ]
Lagani, Vincenzo [1 ,2 ,3 ]
Lehmann, Robert [1 ]
Kiani, Narsis A. [4 ,5 ]
Gomez-Cabrero, David [1 ,6 ]
Tegner, Jesper [1 ,5 ,7 ,8 ]
机构
[1] King Abdullah Univ Sci & Technol KAUST, Biol & Environm Sci & Engn Div, Thuwal, Saudi Arabia
[2] SDAIA KAUST Ctr Excellence Data Sci & Artificial I, Thuwal, Saudi Arabia
[3] Ilia State Univ, Inst Chem Biol, Tbilisi, Georgia
[4] Karolinska Inst, Dept Oncol & Pathol, Algorithm Dynam Lab, Stockholm, Sweden
[5] Karolinska Univ Hosp, Karolinska Inst, Ctr Mol Med, Dept Med,Unit Comp Med, Stockholm, Sweden
[6] Univ Publ Navarra UPNA, IdiSNA, Navarrabiomed, Translat Bioinformat Unit, Pamplona, Spain
[7] King Abdullah Univ Sci & Technol KAUST, Comp Elect & Math Sci & Engn Div, Thuwal, Saudi Arabia
[8] Sci Life Lab, Solna, Sweden
关键词
Cell types - Data driven - Embeddings - Genomic data - Genomics - Language processing - Machine learning algorithms - Natural languages - Pre-training - Single cells;
D O I
10.1038/s42256-023-00757-8
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
The rise of single-cell genomics is an attractive opportunity for data-hungry machine learning algorithms. The scBERT method, inspired by the success of BERT ('bidirectional encoder representations from transformers') in natural language processing, was recently introduced by Yang et al. as a data-driven tool to annotate cell types in single-cell genomics data. Analogous to contextual embedding in BERT, scBERT leverages pretraining and self-attention mechanisms to learn the 'transcriptional grammar' of cells. Here we investigate the reusability beyond the original datasets, assessing the generalizability of natural language techniques in single-cell genomics. The degree of imbalance in the cell-type distribution substantially influences the performance of scBERT. Anticipating an increased utilization of transformers, we highlight the necessity to consider data distribution carefully and introduce a subsampling technique to mitigate the influence of an imbalanced distribution. Our analysis serves as a stepping stone towards understanding and optimizing the use of transformers in single-cell genomics. scBERT, a pretrained neural network for single-cell sequencing tasks, was published last year in Nature Machine Intelligence. To test the reusability of the method, Khan et al. use the code to assess the generalizablility of transformer architectures on single-cell genomics tasks.
引用
收藏
页码:1437 / 1446
页数:13
相关论文
共 50 条
  • [41] Characterization of iCell cardiomyocytes using single-cell RNA-sequencing methods
    Schmid, Christina
    Wohnhaas, Christian T.
    Hildebrandt, Tobias
    Baum, Patrick
    Rast, Georg
    JOURNAL OF PHARMACOLOGICAL AND TOXICOLOGICAL METHODS, 2020, 106
  • [42] CELL SUBCLASS IDENTIFICATION IN SINGLE-CELL RNA-SEQUENCING DATA USING ORTHOGONAL NONNEGATIVE MATRIX FACTORIZATION
    Wang, Shuai
    Wu, Peng
    Zhou, Manqi
    Chang, Tsung-Hui
    Wu, Song
    2018 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2018, : 876 - 880
  • [43] Quantitative assessment of single-cell RNA-sequencing methods
    Angela R Wu
    Norma F Neff
    Tomer Kalisky
    Piero Dalerba
    Barbara Treutlein
    Michael E Rothenberg
    Francis M Mburu
    Gary L Mantalas
    Sopheak Sim
    Michael F Clarke
    Stephen R Quake
    Nature Methods, 2014, 11 : 41 - 46
  • [44] Power analysis of single-cell RNA-sequencing experiments
    Svensson, Valentine
    Natarajan, Kedar Nath
    Ly, Lam-Ha
    Miragaia, Ricardo J.
    Labalette, Charlotte
    Macaulay, Iain C.
    Cvejic, Ana
    Teichmann, Sarah A.
    NATURE METHODS, 2017, 14 (04) : 381 - +
  • [45] A Data-Driven Clustering Recommendation Method for Single-Cell RNA-Sequencing Data
    Tian, Yu
    Zheng, Ruiqing
    Liang, Zhenlan
    Li, Suning
    Wu, Fang-Xiang
    Li, Min
    TSINGHUA SCIENCE AND TECHNOLOGY, 2021, 26 (05) : 772 - 789
  • [46] Power analysis of single-cell RNA-sequencing experiments
    Valentine Svensson
    Kedar Nath Natarajan
    Lam-Ha Ly
    Ricardo J Miragaia
    Charlotte Labalette
    Iain C Macaulay
    Ana Cvejic
    Sarah A Teichmann
    Nature Methods, 2017, 14 : 381 - 387
  • [47] A Data-Driven Clustering Recommendation Method for Single-Cell RNA-Sequencing Data
    Yu Tian
    Ruiqing Zheng
    Zhenlan Liang
    Suning Li
    Fang-Xiang Wu
    Min Li
    TsinghuaScienceandTechnology, 2021, 26 (05) : 772 - 789
  • [48] SPECK: an unsupervised learning approach for cell surface receptor abundance estimation for single-cell RNA-sequencing data
    Javaid, Azka
    Frost, H. Robert
    BIOINFORMATICS ADVANCES, 2023, 3 (01):
  • [49] Clustering single-cell rna-sequencing data based on matching clusters structures
    Wang, Yizhang
    Zhou, You
    Pang, Wie
    Liang, Yanchun
    Wang, Shu
    Tehnicki Vjesnik, 2020, 27 (01): : 89 - 95
  • [50] Recovery of missing single-cell RNA-sequencing data with optimized transcriptomic references
    Pool, Allan-Hermann
    Poldsam, Helen
    Chen, Sisi
    Thomson, Matt
    Oka, Yuki
    NATURE METHODS, 2023, 20 (10) : 1506 - +