SELFormer: molecular representation learning via SELFIES language models

被引:24
|
作者
Yuksel, Atakan [1 ]
Ulusoy, Erva [1 ,2 ]
Unlu, Atabey [1 ,2 ]
Dogan, Tunca [1 ,2 ,3 ]
机构
[1] Hacettepe Univ, Dept Comp Engn, Biol Data Sci Lab, Ankara, Turkiye
[2] Hacettepe Univ, Grad Sch Hlth Sci, Dept Bioinformat, Ankara, Turkiye
[3] Hacettepe Univ, Inst Informat, Ankara, Turkiye
来源
关键词
molecular representation learning; drug discovery; molecular property prediction; natural language processing; transformers; FREE TOOL; SOLUBILITY; DATABASE; CHEMISTRY;
D O I
10.1088/2632-2153/acdb30
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data, for efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose, which is the most extensively used string-based encoding for molecules. However, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model (CLM) that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based CLMs, on predicting aqueous solubility of molecules and adverse drug reactions, while producing comparable results for the remaining tasks. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models at https:// github.com/HUBioDataLab/SELFormer. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] Cross-language Citation Recommendation via Hierarchical Representation Learning on Heterogeneous Graph
    Jiang, Zhuoren
    Yin, Yue
    Gao, Liangcai
    Lu, Yao
    Liu, Xiaozhong
    ACM/SIGIR PROCEEDINGS 2018, 2018, : 635 - 644
  • [32] AdaMGT: Molecular representation learning via adaptive mixture of GCN-Transformer
    Ding, Cangfeng
    Yan, Zhaoyao
    Ma, Lerong
    Cao, Bohao
    Cao, Lu
    KNOWLEDGE-BASED SYSTEMS, 2025, 311
  • [33] Molecular representation contrastive learning via transformer embedding to graph neural networks
    Liu, Yunwu
    Zhang, Ruisheng
    Li, Tongfeng
    Jiang, Jing
    Ma, Jun
    Yuan, Yongna
    Wang, Ping
    APPLIED SOFT COMPUTING, 2024, 164
  • [34] Molecular set representation learning
    Boulougouri, Maria
    Vandergheynst, Pierre
    Probst, Daniel
    NATURE MACHINE INTELLIGENCE, 2024, 6 (07) : 754 - 763
  • [35] Integrating Reinforcement Learning with Models of Representation Learning
    Jones, Matt
    Canas, Fabian
    COGNITION IN FLUX, 2010, : 1258 - 1263
  • [36] Data representation learning via dictionary learning and self-representation
    Deyu Zeng
    Jing Sun
    Zongze Wu
    Chris Ding
    Zhigang Ren
    Applied Intelligence, 2023, 53 : 26988 - 27000
  • [37] Data representation learning via dictionary learning and self-representation
    Zeng, Deyu
    Su, Jing
    Wu, Zongze
    Ding, Chris
    Ren, Zhigang
    APPLIED INTELLIGENCE, 2023, 53 (22) : 26988 - 27000
  • [38] Scalable Rule Learning via Learning Representation
    Omran, Pouya Ghiasnezhad
    Wang, Kewen
    Wang, Zhe
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 2149 - 2155
  • [39] Can language representation models think in bets?
    Tang, Zhisheng
    Kejriwal, Mayank
    ROYAL SOCIETY OPEN SCIENCE, 2023, 10 (03):
  • [40] Monotonic Representation of Numeric Properties in Language Models
    Heinzerling, Benjamin
    Inui, Kentaro
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 2: SHORT PAPERS, 2024, : 175 - 195