Are genomic language models all you need? Exploring genomic language models on protein downstream tasks

被引:0
|
作者
Boshar, Sam [1 ]
Trop, Evan [1 ]
de Almeida, Bernardo P. [2 ]
Copoiu, Liviu [3 ]
Pierrot, Thomas [1 ]
机构
[1] InstaDeep, Cambridge, MA 02142 USA
[2] InstaDeep, Paris, France
[3] InstaDeep, London W2 1AY, England
关键词
PREDICTION;
D O I
10.1093/bioinformatics/btae529
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs.Results In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics.Availability and implementation We make our inference code, 3mer pre-trained model weights and datasets available.
引用
收藏
页数:15
相关论文
共 50 条
  • [21] ALERT: Adapting Language Models to Reasoning Tasks
    Yu, Ping
    Wang, Tianlu
    Golovneva, Olga
    AlKhamissi, Badr
    Verma, Siddharth
    Jin, Zhijing
    Ghosh, Gargi
    Diab, Mona
    Celikyilmaz, Asli
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 1055 - 1081
  • [22] Exploring Dialogism Using Language Models
    Ruseti, Stefan
    Dascalu, Maria-Dorinela
    Corlatescu, Dragos-Georgian
    Dascalu, Mihai
    Trausan-Matu, Stefan
    McNamara, Danielle S.
    ARTIFICIAL INTELLIGENCE IN EDUCATION (AIED 2021), PT II, 2021, 12749 : 296 - 301
  • [23] Language Models can Solve Computer Tasks
    Kim, Geunwoo
    Baldi, Pierre
    McAleer, Stephen
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [24] On Grounded Planning for Embodied Tasks with Language Models
    Lin, Bill Yuchen
    Huang, Chengsong
    Liu, Qian
    Gu, Wenda
    Sommerer, Sam
    Ren, Xiang
    THIRTY-SEVENTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 37 NO 11, 2023, : 13192 - 13200
  • [25] Robustness of GPT Large Language Models on Natural Language Processing Tasks
    Xuanting C.
    Junjie Y.
    Can Z.
    Nuo X.
    Tao G.
    Qi Z.
    Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2024, 61 (05): : 1128 - 1142
  • [26] Language models for protein design
    Lee, Jin Sub
    Abdin, Osama
    Kim, Philip M.
    CURRENT OPINION IN STRUCTURAL BIOLOGY, 2025, 92
  • [27] Exploring the potential of large language models for author profiling tasks in digital text forensics
    Cho, Sang-Hyun
    Kim, Dohyun
    Kwon, Hyuk-Chul
    Kim, Minho
    FORENSIC SCIENCE INTERNATIONAL-DIGITAL INVESTIGATION, 2024, 50
  • [28] Direction is what you need: Improving Word Embedding Compression in Large Language Models
    Balazy, Klaudia
    Banaei, Mohammadreza
    Lebret, Remi
    Tabor, Jacek
    Aberer, Karl
    REPL4NLP 2021: PROCEEDINGS OF THE 6TH WORKSHOP ON REPRESENTATION LEARNING FOR NLP, 2021, : 322 - 330
  • [29] Artificial intelligence, large language models, and you
    Marquardt, Charles
    JOURNAL OF VASCULAR SURGERY CASES INNOVATIONS AND TECHNIQUES, 2023, 9 (04):
  • [30] Large Language Models Need Symbolic AI
    Hammond, Kristian
    Leake, David
    NEURAL-SYMBOLIC LEARNING AND REASONING 2023, NESY 2023, 2023,