Molecular representation learning based on Transformer with fixed-length padding method

被引：0

作者：

Wu, Yichu ^{[1
]}

Yang, Yang ^{[1
]}

Zhang, Ruimeng ^{[1
]}

Chen, Zijian ^{[1
]}

Jin, Meichen ^{[1
]}

Zou, Yi ^{[1
]}

Wang, Zhonghua ^{[1
,2
]}

Wu, Fanhong ^{[1
,2
]}

机构：

[1] Shanghai Inst Technol, Sch Chem & Environm Engn, Shanghai 201418, Peoples R China

[2] Shanghai Engn Res Ctr Green Fluoropharmaceut Techn, Shanghai 201418, Peoples R China

来源：

JOURNAL OF MOLECULAR STRUCTURE | 2025年 / 1319卷

基金：

中国国家自然科学基金;

关键词：

Molecular representation learning; Transformer; Fixed-length padding; Molecular property prediction; NOMENCLATURE; DATABASE;

D O I：

10.1016/j.molstruc.2024.139574

中图分类号：

O64 [物理化学（理论化学）、化学物理学];

学科分类号：

070304 ; 081704 ;

摘要：

Effective molecular representation learning plays an important role in molecular modeling process of drug design, protein engineering, material science and so on. Currently, self-supervised learning models based on the Transformer architecture have shown great promise in molecular representation. However, batch training of Transformer model requires input data of consistent length, while the length of each entry's molecular data (SMILES sequence) is inconsistent, which results in the model being unable to process batches directly. Therefore, corresponding strategies should be proposed to enable the model to smoothly process data with inconsistent length in batches. In this work, we adopt a strategy of head-tail padding and tail padding to obtain fixed-length data, which are employed as inputs for the Transformer encoder and decoder respectively, thus overcoming the limitation of the Transformer's inability to batch process input data with inconsistent length. In this way, our Transformer-based model can be used for batch training of molecular data, thereby improving the efficiency, accuracy, and simplicity of molecular representation learning. Subsequently, public datasets are used to evaluate the performance of our molecular representation model in predicting molecular property. In the classification and regression tasks, the average ROC-AUC and RMSE values improves by over 10.3% and 3.3% respectively compared to the baseline models. Furthermore, the specific distributions are found after the compressing molecular representation vectors into two-dimensional or three-dimensional space using PCA dimensionality reduction algorithm, instead of random distributions. Our work highlights the potential of Transformer model in batch training for constructing molecular representation model, thus providing new path for AI technology in molecular modeling.

引用

页数：12

共 50 条

[41] A Topological Charge Continuously Tunable Orbital Angular Momentum (OAM) Electromagnetic Wave Generation Method Based on Fixed-length Delay Line Mixing Circuit
Zhou, Yuliang
Li, Xiaona
Yao, Kaiyuan
Huang, Yong Mao
Jin, Haiyan
APPLIED COMPUTATIONAL ELECTROMAGNETICS SOCIETY JOURNAL, 2022, 37 (10): : 1071 - 1076
[42] A Fixed-Length Transfer Delay Based Adaptive Frequency-Locked Loop for Single-Phase Systems
Dai, Zhiyong
Zhang, Zhen
Yang, Yongheng
Blaabjerg, Frede
Huangfu, Yigeng
Zhang, Juxiang
IEEE TRANSACTIONS ON POWER ELECTRONICS, 2019, 34 (05) : 4000 - 4004
[43] A Generic Framework for Scan Capture Power Reduction in Fixed-Length Symbol-based Test Compression Environment
Liu, Xiao
Xu, Qiang
DATE: 2009 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, VOLS 1-3, 2009, : 1494 - 1499
[44] A dynamic graph representation learning based on temporal graph transformer
Zhong, Ying
Huang, Chenze
ALEXANDRIA ENGINEERING JOURNAL, 2023, 63 : 359 - 369
[45] Transformer-Based Representation Learning on Temporal Heterogeneous Graphs
Li, Longhai
Duan, Lei
Wang, Junchen
Xie, Guicai
He, Chengxin
Chen, Zihao
Deng, Song
WEB AND BIG DATA, PT II, APWEB-WAIM 2022, 2023, 13422 : 385 - 400
[46] A dynamic graph representation learning based on temporal graph transformer
Zhong, Ying
Huang, Chenze
ALEXANDRIA ENGINEERING JOURNAL, 2023, 63 : 359 - 369
[47] Arabic Speech Classification Method Based on Padding and Deep Learning Neural Network
Asroni, Asroni
Ku-Mahamud, Ku Ruhana
Damarjati, Cahya
Slamat, Hasan Basri
BAGHDAD SCIENCE JOURNAL, 2021, 18 (02) : 925 - 936
[48] MTBF CONFIDENCE BOUNDS BASED ON MIL-STD-781C FIXED-LENGTH TEST-RESULTS
HARTER, HL
JOURNAL OF QUALITY TECHNOLOGY, 1978, 10 (04) : 164 - 169
[49] AdaMGT: Molecular representation learning via adaptive mixture of GCN-Transformer
Ding, Cangfeng
Yan, Zhaoyao
Ma, Lerong
Cao, Bohao
Cao, Lu
KNOWLEDGE-BASED SYSTEMS, 2025, 311
[50] Molecular representation contrastive learning via transformer embedding to graph neural networks
Liu, Yunwu
Zhang, Ruisheng
Li, Tongfeng
Jiang, Jing
Ma, Jun
Yuan, Yongna
Wang, Ping
APPLIED SOFT COMPUTING, 2024, 164

← 1 2 3 4 5 →