SELFormer: molecular representation learning via SELFIES language models

被引:24
|
作者
Yuksel, Atakan [1 ]
Ulusoy, Erva [1 ,2 ]
Unlu, Atabey [1 ,2 ]
Dogan, Tunca [1 ,2 ,3 ]
机构
[1] Hacettepe Univ, Dept Comp Engn, Biol Data Sci Lab, Ankara, Turkiye
[2] Hacettepe Univ, Grad Sch Hlth Sci, Dept Bioinformat, Ankara, Turkiye
[3] Hacettepe Univ, Inst Informat, Ankara, Turkiye
来源
关键词
molecular representation learning; drug discovery; molecular property prediction; natural language processing; transformers; FREE TOOL; SOLUBILITY; DATABASE; CHEMISTRY;
D O I
10.1088/2632-2153/acdb30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data, for efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose, which is the most extensively used string-based encoding for molecules. However, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model (CLM) that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based CLMs, on predicting aqueous solubility of molecules and adverse drug reactions, while producing comparable results for the remaining tasks. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models at https:// github.com/HUBioDataLab/SELFormer. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.
引用
收藏
页数:20
相关论文
共 50 条
  • [21] Deep Molecular Representation Learning via Fusing Physical and Chemical Information
    Yang, Shuwen
    Li, Ziyao
    Song, Guojie
    Cai, Lingsheng
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [22] Representation of the language and models of language change: grammaticalization as perspective
    Feltgen, Quentin
    Fagard, Benjamin
    Nadal, Jean-Pierre
    TRAITEMENT AUTOMATIQUE DES LANGUES, 2014, 55 (03): : 47 - 71
  • [23] Incorporating Molecular Knowledge in Large Language Models via Multimodal Modeling
    Yang, Zekun
    Lv, Kun
    Shu, Jian
    Li, Zheng
    Xiao, Ping
    IEEE TRANSACTIONS ON COMPUTATIONAL SOCIAL SYSTEMS, 2025,
  • [24] Adapting Vision-Language Models via Learning to Inject Knowledge
    Xuan, Shiyu
    Yang, Ming
    Zhang, Shiliang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2024, 33 : 5798 - 5809
  • [25] Knowledge Graphs and Pretrained Language Models Enhanced Representation Learning for Conversational Recommender Systems
    Qiu, Zhangchi
    Tao, Ye
    Pan, Shirui
    Liew, Alan Wee-Chung
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 15
  • [26] Multi-Patch Prediction: Adapting Language Models for Time Series Representation Learning
    Bian, Yuxuan
    Ju, Xuan
    Li, Jiangtong
    Xu, Zhijian
    Cheng, Dawei
    Xu, Qiang
    INTERNATIONAL CONFERENCE ON MACHINE LEARNING, 2024, 235
  • [27] Coreferential Reasoning Learning for Language Representation
    Ye, Deming
    Lin, Yankai
    Du, Jiaju
    Liu, Zhenghao
    Li, Peng
    Sun, Maosong
    Liu, Zhiyuan
    PROCEEDINGS OF THE 2020 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP), 2020, : 7170 - 7186
  • [28] Representation Learning for Natural Language Processing
    刘知远
    林衍凯
    孙茂松
    中文信息学报, 2021, (03) : 143 - 143
  • [29] Automating smart recommendation from natural language API descriptions via representation learning
    Xiong, Wei
    Lu, Zhihui
    Li, Bing
    Hang, Bo
    Wu, Zhao
    FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2018, 87 : 382 - 391
  • [30] Ethereum fraud detection via joint transaction language model and graph representation learning
    Sun, Jianguo
    Jia, Yifan
    Wang, Yanbin
    Tian, Ye
    Zhang, Sheng
    INFORMATION FUSION, 2025, 120