Are genomic language models all you need? Exploring genomic language models on protein downstream tasks

被引:0
|
作者
Boshar, Sam [1 ]
Trop, Evan [1 ]
de Almeida, Bernardo P. [2 ]
Copoiu, Liviu [3 ]
Pierrot, Thomas [1 ]
机构
[1] InstaDeep, Cambridge, MA 02142 USA
[2] InstaDeep, Paris, France
[3] InstaDeep, London W2 1AY, England
关键词
PREDICTION;
D O I
10.1093/bioinformatics/btae529
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs.Results In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics.Availability and implementation We make our inference code, 3mer pre-trained model weights and datasets available.
引用
收藏
页数:15
相关论文
共 50 条
  • [41] Exploring Large Language Models for Classical Philology
    Riemenschneider, Frederick
    Frank, Anette
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15181 - 15199
  • [42] Exploring Mathematical Conjecturing with Large Language Models
    Johansson, Moa
    Smallbone, Nicholas
    NEURAL-SYMBOLIC LEARNING AND REASONING 2023, NESY 2023, 2023,
  • [43] Exploring Length Generalization in Large Language Models
    Anil, Cem
    Wu, Yuhuai
    Andreassen, Anders
    Lewkowycz, Aitor
    Misra, Vedant
    Ramasesh, Vinay
    Slone, Ambrose
    Gur-Ari, Guy
    Dyer, Ethan
    Neyshabur, Behnam
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [44] Exploring evolution-aware & -free protein language models as protein function predictors
    Hu, Mingyang
    Yuan, Fajie
    Yang, Kevin K.
    Ju, Fusong
    Su, Jin
    Wang, Hui
    Yang, Fei
    Ding, Qiuyang
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [45] Benchmarking protein language models for protein crystallization
    Mall, Raghvendra
    Kaushik, Rahul
    Martinez, Zachary A.
    Thomson, Matt W.
    Castiglione, Filippo
    SCIENTIFIC REPORTS, 2025, 15 (01):
  • [46] Controllable protein design with language models
    Noelia Ferruz
    Birte Höcker
    Nature Machine Intelligence, 2022, 4 : 521 - 532
  • [47] Protein language models using convolutions
    Tang, Lin
    NATURE METHODS, 2024, 21 (04) : 550 - 550
  • [48] Controllable protein design with language models
    Ferruz, Noelia
    Hoecker, Birte
    NATURE MACHINE INTELLIGENCE, 2022, 4 (06) : 521 - 532
  • [49] Fine-tuning protein language models boosts predictions across diverse tasks
    Schmirler, Robert
    Heinzinger, Michael
    Rost, Burkhard
    NATURE COMMUNICATIONS, 2024, 15 (01)
  • [50] Sequence-Structure Embeddings via Protein Language Models Improve on Prediction Tasks
    Kabir, Anowarul
    Shehu, Amarda
    2022 IEEE INTERNATIONAL CONFERENCE ON KNOWLEDGE GRAPH (ICKG), 2022, : 105 - 112