Are genomic language models all you need? Exploring genomic language models on protein downstream tasks

被引:0
|
作者
Boshar, Sam [1 ]
Trop, Evan [1 ]
de Almeida, Bernardo P. [2 ]
Copoiu, Liviu [3 ]
Pierrot, Thomas [1 ]
机构
[1] InstaDeep, Cambridge, MA 02142 USA
[2] InstaDeep, Paris, France
[3] InstaDeep, London W2 1AY, England
关键词
PREDICTION;
D O I
10.1093/bioinformatics/btae529
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Motivation Large language models, trained on enormous corpora of biological sequences, are state-of-the-art for downstream genomic and proteomic tasks. Since the genome contains the information to encode all proteins, genomic language models (gLMs) hold the potential to make downstream predictions not only about DNA sequences, but also about proteins. However, the performance of gLMs on protein tasks remains unknown, due to few tasks pairing proteins with the coding DNA sequences (CDS) that can be processed by gLMs.Results In this work, we curated five such datasets and used them to evaluate the performance of gLMs and proteomic language models (pLMs). We show that gLMs are competitive and even outperform their pLMs counterparts on some tasks. The best performance was achieved using the retrieved CDS compared to sampling strategies. We found that training a joint genomic-proteomic model outperforms each individual approach, showing that they capture different but complementary sequence representations, as we demonstrate through model interpretation of their embeddings. Lastly, we explored different genomic tokenization schemes to improve downstream protein performance. We trained a new Nucleotide Transformer (50M) foundation model with 3mer tokenization that outperforms its 6mer counterpart on protein tasks while maintaining performance on genomics tasks. The application of gLMs to proteomics offers the potential to leverage rich CDS data, and in the spirit of the central dogma, the possibility of a unified and synergistic approach to genomics and proteomics.Availability and implementation We make our inference code, 3mer pre-trained model weights and datasets available.
引用
收藏
页数:15
相关论文
共 50 条
  • [1] Language Is Not All You Need: Aligning Perception with Language Models
    Huang, Shaohan
    Dong, Li
    Wang, Wenhui
    Hao, Yaru
    Singhal, Saksham
    Ma, Shuming
    Lv, Tengchao
    Cui, Lei
    Mohammed, Owais Khan
    Patra, Barun
    Liu, Qiang
    Aggarwal, Kriti
    Chi, Zewen
    Bjorck, Johan
    Chaudhary, Vishrav
    Som, Subhojit
    Song, Xia
    Wei, Furu
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [2] Are Large Language Models All You Need for Task-Oriented Dialogue?
    Hudecek, Vojtech
    Dusek, Ondrej
    24TH MEETING OF THE SPECIAL INTEREST GROUP ON DISCOURSE AND DIALOGUE, SIGDIAL 2023, 2023, : 216 - 228
  • [3] Images in Language Space: Exploring the Suitability of Large Language Models for Vision & Language Tasks
    Hakimov, Sherzod
    Schlangen, David
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 14196 - 14210
  • [4] Genomic language models could transform medicine but not yet
    Micaela Elisa Consens
    Ben Li
    Anna R. Poetsch
    Stephen Gilbert
    npj Digital Medicine, 8 (1)
  • [5] ProkBERT family: genomic language models for microbiome applications
    Ligeti, Balazs
    Szepesi-Nagy, Istvan
    Bodnar, Babett
    Ligeti-Nagy, Noemi
    Juhasz, Janos
    FRONTIERS IN MICROBIOLOGY, 2024, 14
  • [6] Is Attention All You Need? Toward a Conceptual Model for Social Awareness in Large Language Models
    Voria, Gianmario
    Catolino, Gemma
    Palomba, Fabio
    PROCEEDINGS 2024 IEEE/ACM FIRST INTERNATIONAL CONFERENCE ON AI FOUNDATION MODELS AND SOFTWARE ENGINEERING, FORGE 2024, 2024, : 69 - 73
  • [7] Employing Compact Intra-genomic Language Models to Predict Genomic Sequences and Characterize Their Entropy
    Deusdado, Sergio
    Carvalho, Paulo
    ADVANCES IN BIOINFORMATICS, 2010, 74 : 143 - +
  • [8] Automated Tailoring of Large Language Models for Industry-Specific Downstream Tasks
    Saxena, Shreya
    Prasad, Siva
    Muneeswaran, I
    Shankar, Advaith
    Varun, V.
    Gopalakrishnant, Saisubramaniam
    Vaddinat, Vishal
    PROCEEDINGS OF THE 17TH ACM INTERNATIONAL CONFERENCE ON WEB SEARCH AND DATA MINING, WSDM 2024, 2024, : 1184 - 1185
  • [9] The Use of Clinical Language Models Pretrained on Institutional EHR Data for Downstream Tasks
    Suvirat, Kerdkiat
    Chairat, Sawrawit
    Horsiritham, Kanakorn
    Ingviya, Thammasin
    Kongkamol, Chanon
    Chaichulee, Sitthichok
    2024 21ST INTERNATIONAL JOINT CONFERENCE ON COMPUTER SCIENCE AND SOFTWARE ENGINEERING, JCSSE 2024, 2024, : 648 - 655
  • [10] Exploring Advancements in Genomic Medicine: An Integrated Approach using explainable AI (XAI) and Large Language Models
    Tago, Shinichiro
    Murakami, Katsuhiko
    Takishita, Sho
    Morikawa, Hiroaki
    Kojima, Rikuhiro
    Abe, Shuya
    Yokoyama, Kazuaki
    Ogawa, Miho
    Fukushima, Hidehito
    Takamori, Hiroyuki
    Nannya, Yasuhito
    Imoto, Seiya
    Fuji, Masaru
    CANCER SCIENCE, 2025, 116 : 1057 - 1057