SELFormer: molecular representation learning via SELFIES language models

被引：24

作者：

Yuksel, Atakan ^{[1
]}

Ulusoy, Erva ^{[1
,2
]}

Unlu, Atabey ^{[1
,2
]}

Dogan, Tunca ^{[1
,2
,3
]}

机构：

[1] Hacettepe Univ, Dept Comp Engn, Biol Data Sci Lab, Ankara, Turkiye

[2] Hacettepe Univ, Grad Sch Hlth Sci, Dept Bioinformat, Ankara, Turkiye

[3] Hacettepe Univ, Inst Informat, Ankara, Turkiye

来源：

MACHINE LEARNING-SCIENCE AND TECHNOLOGY | 2023年 / 4卷 / 02期

关键词：

molecular representation learning; drug discovery; molecular property prediction; natural language processing; transformers; FREE TOOL; SOLUBILITY; DATABASE; CHEMISTRY;

D O I：

10.1088/2632-2153/acdb30

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automated computational analysis of the vast chemical space is critical for numerous fields of research such as drug discovery and material science. Representation learning techniques have recently been employed with the primary objective of generating compact and informative numerical expressions of complex data, for efficient usage in subsequent prediction tasks. One approach to efficiently learn molecular representations is processing string-based notations of chemicals via natural language processing algorithms. Majority of the methods proposed so far utilize SMILES notations for this purpose, which is the most extensively used string-based encoding for molecules. However, SMILES is associated with numerous problems related to validity and robustness, which may prevent the model from effectively uncovering the knowledge hidden in the data. In this study, we propose SELFormer, a transformer architecture-based chemical language model (CLM) that utilizes a 100% valid, compact and expressive notation, SELFIES, as input, in order to learn flexible and high-quality molecular representations. SELFormer is pre-trained on two million drug-like compounds and fine-tuned for diverse molecular property prediction tasks. Our performance evaluation has revealed that, SELFormer outperforms all competing methods, including graph learning-based approaches and SMILES-based CLMs, on predicting aqueous solubility of molecules and adverse drug reactions, while producing comparable results for the remaining tasks. We also visualized molecular representations learned by SELFormer via dimensionality reduction, which indicated that even the pre-trained model can discriminate molecules with differing structural properties. We shared SELFormer as a programmatic tool, together with its datasets and pre-trained models at https:// github.com/HUBioDataLab/SELFormer. Overall, our research demonstrates the benefit of using the SELFIES notations in the context of chemical language modeling and opens up new possibilities for the design and discovery of novel drug candidates with desired features.

引用

页数：20

共 50 条

[1] Group SELFIES: a robust fragment-based molecular string representation
Cheng, Austin H.
Cai, Andy
Miret, Santiago
Malkomes, Gustavo
Phielipp, Mariano
Aspuru-Guzik, Alan
DIGITAL DISCOVERY, 2023, 2 (03): : 748 - 758
[2] FunQG: Molecular Representation Learning via Quotient Graphs
Hajiabolhassan, Hossein
Taheri, Zahra
Hojatnia, Ali
Yeganeh, Yavar Taheri
JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2023, 63 (11) : 3275 - 3287
[3] Consistent Stakeholder Modifications of Formal Models via a Natural Language Representation
Gabrysiak, Gregor
Eichler, Daniel
Hebig, Regina
Giese, Holger
2013 1ST INTERNATIONAL WORKSHOP ON NATURAL LANGUAGE ANALYSIS IN SOFTWARE ENGINEERING (NATURALISE), 2013, : 1 - 8
[4] History Compression via Language Models in Reinforcement Learning
Paischer, Fabian
Adler, Thomas
Patil, Vihang
Bitto-Nemling, Angela
Holzleitner, Markus
Lehner, Sebastian
Eghbal-zadeh, Hamid
Hochreiter, Sepp
INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
[5] Learning protein language contrastive models with multi-knowledge representation
Xu, Wenjun
Xia, Yingchun
Sun, Bifan
Zhao, Zihao
Tang, Lianggui
Zhou, Obo
Wang, Qingyong
Gu, Lichuan
FUTURE GENERATION COMPUTER SYSTEMS-THE INTERNATIONAL JOURNAL OF ESCIENCE, 2025, 164
[6] INTERCULTURALITY, REPRESENTATION AND IDENTITY IN CONTEXTS OF FOREIGN LANGUAGE LEARNING VIA TELETANDEM
Alves, Adriana Celia
TEXTO LIVRE-LINGUAGEM E TECNOLOGIA, 2018, 11 (02): : 18 - 33
[7] Language Representation Models: An Overview
Schomacker, Thorben
Tropmann-Frick, Marina
ENTROPY, 2021, 23 (11)
[8] Molecular Graph Representation Learning via Structural Similarity Information
Yao, Chengyu
Huang, Hong
Gao, Hang
Wu, Fengge
Chen, Haiming
Zhao, Junsuo
MACHINE LEARNING AND KNOWLEDGE DISCOVERY IN DATABASES: RESEARCH TRACK, PT III, ECML PKDD 2024, 2024, 14943 : 351 - 367
[9] TOCOL: improving contextual representation of pre-trained language models via token-level contrastive learning
Wang, Keheng
Yin, Chuantao
Li, Rumei
Wang, Sirui
Xian, Yunsen
Rong, Wenge
Xiong, Zhang
MACHINE LEARNING, 2024, 113 (07) : 3999 - 4012
[10] Stop the KillFies! Using Deep Learning Models to Identify Dangerous Selfies
Nanda, Vedant
Lamba, Hemank
Agarwal, Divyansh
Arora, Megha
Sachdeva, Niharika
Kumaraguru, Ponnurangam
COMPANION PROCEEDINGS OF THE WORLD WIDE WEB CONFERENCE 2018 (WWW 2018), 2018, : 1341 - 1345

← 1 2 3 4 5 →