SELFormer: molecular representation learning via SELFIES language models

被引:24
|
作者
Yuksel, Atakan [1 ]
Ulusoy, Erva [1 ,2 ]
Unlu, Atabey [1 ,2 ]
Dogan, Tunca [1 ,2 ,3 ]
机构
[1] Hacettepe Univ, Dept Comp Engn, Biol Data Sci Lab, Ankara, Turkiye
[2] Hacettepe Univ, Grad Sch Hlth Sci, Dept Bioinformat, Ankara, Turkiye
[3] Hacettepe Univ, Inst Informat, Ankara, Turkiye
来源
关键词
molecular representation learning; drug discovery; molecular property prediction; natural language processing; transformers; FREE TOOL; SOLUBILITY; DATABASE; CHEMISTRY;
D O I
10.1088/2632-2153/acdb30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data, for efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose, which is the most extensively used string-based encoding for molecules. However, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model (CLM) that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based CLMs, on predicting aqueous solubility of molecules and adverse drug reactions, while producing comparable results for the remaining tasks. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models at https:// github.com/HUBioDataLab/SELFormer. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.
引用
收藏
页数:20
相关论文
共 50 条
  • [1] Group SELFIES: a robust fragment-based molecular string representation
    Cheng, Austin H.
    Cai, Andy
    Miret, Santiago
    Malkomes, Gustavo
    Phielipp, Mariano
    Aspuru-Guzik, Alan
    DIGITAL DISCOVERY, 2023, 2 (03): : 748 - 758
  • [2] FunQG: Molecular Representation Learning via Quotient Graphs
    Hajiabolhassan, Hossein
    Taheri, Zahra
    Hojatnia, Ali
    Yeganeh, Yavar Taheri
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2023, 63 (11) : 3275 - 3287
  • [3] Consistent Stakeholder Modifications of Formal Models via a Natural Language Representation
    Gabrysiak, Gregor
    Eichler, Daniel
    Hebig, Regina
    Giese, Holger
    2013 1ST INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE ANALYSIS IN SOFTWARE ENGINEERING (NATURALISE), 2013, : 1 - 8
  • [4] History Compression via Language Models in Reinforcement Learning
    Paischer, Fabian
    Adler, Thomas
    Patil, Vihang
    Bitto-Nemling, Angela
    Holzleitner, Markus
    Lehner, Sebastian
    Eghbal-zadeh, Hamid
    Hochreiter, Sepp
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [5] Learning protein language contrastive models with multi-knowledge representation
    Xu, Wenjun
    Xia, Yingchun
    Sun, Bifan
    Zhao, Zihao
    Tang, Lianggui
    Zhou, Obo
    Wang, Qingyong
    Gu, Lichuan
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 164
  • [6] INTERCULTURALITY, REPRESENTATION AND IDENTITY IN CONTEXTS OF FOREIGN LANGUAGE LEARNING VIA TELETANDEM
    Alves, Adriana Celia
    TEXTO LIVRE-LINGUAGEM E TECNOLOGIA, 2018, 11 (02): : 18 - 33
  • [7] Language Representation Models: An Overview
    Schomacker, Thorben
    Tropmann-Frick, Marina
    ENTROPY, 2021, 23 (11)
  • [8] Molecular Graph Representation Learning via Structural Similarity Information
    Yao, Chengyu
    Huang, Hong
    Gao, Hang
    Wu, Fengge
    Chen, Haiming
    Zhao, Junsuo
    MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, PT III, ECML PKDD 2024, 2024, 14943 : 351 - 367
  • [9] TOCOL: improving contextual representation of pre-trained language models via token-level contrastive learning
    Wang, Keheng
    Yin, Chuantao
    Li, Rumei
    Wang, Sirui
    Xian, Yunsen
    Rong, Wenge
    Xiong, Zhang
    MACHINE LEARNING, 2024, 113 (07) : 3999 - 4012
  • [10] Stop the KillFies! Using Deep Learning Models to Identify Dangerous Selfies
    Nanda, Vedant
    Lamba, Hemank
    Agarwal, Divyansh
    Arora, Megha
    Sachdeva, Niharika
    Kumaraguru, Ponnurangam
    COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, : 1341 - 1345