Molecular representation learning based on Transformer with fixed-length padding method

被引：0

作者：

Wu, Yichu ^{[1
]}

Yang, Yang ^{[1
]}

Zhang, Ruimeng ^{[1
]}

Chen, Zijian ^{[1
]}

Jin, Meichen ^{[1
]}

Zou, Yi ^{[1
]}

Wang, Zhonghua ^{[1
,2
]}

Wu, Fanhong ^{[1
,2
]}

机构：

[1] Shanghai Inst Technol, Sch Chem & Environm Engn, Shanghai 201418, Peoples R China

[2] Shanghai Engn Res Ctr Green Fluoropharmaceut Techn, Shanghai 201418, Peoples R China

来源：

JOURNAL OF MOLECULAR STRUCTURE | 2025年 / 1319卷

基金：

中国国家自然科学基金;

关键词：

Molecular representation learning; Transformer; Fixed-length padding; Molecular property prediction; NOMENCLATURE; DATABASE;

D O I：

10.1016/j.molstruc.2024.139574

中图分类号：

O64 [物理化学（理论化学）、化学物理学];

学科分类号：

070304 ; 081704 ;

摘要：

Effective molecular representation learning plays an important role in molecular modeling process of drug design, protein engineering, material science and so on. Currently, self-supervised learning models based on the Transformer architecture have shown great promise in molecular representation. However, batch training of Transformer model requires input data of consistent length, while the length of each entry's molecular data (SMILES sequence) is inconsistent, which results in the model being unable to process batches directly. Therefore, corresponding strategies should be proposed to enable the model to smoothly process data with inconsistent length in batches. In this work, we adopt a strategy of head-tail padding and tail padding to obtain fixed-length data, which are employed as inputs for the Transformer encoder and decoder respectively, thus overcoming the limitation of the Transformer's inability to batch process input data with inconsistent length. In this way, our Transformer-based model can be used for batch training of molecular data, thereby improving the efficiency, accuracy, and simplicity of molecular representation learning. Subsequently, public datasets are used to evaluate the performance of our molecular representation model in predicting molecular property. In the classification and regression tasks, the average ROC-AUC and RMSE values improves by over 10.3% and 3.3% respectively compared to the baseline models. Furthermore, the specific distributions are found after the compressing molecular representation vectors into two-dimensional or three-dimensional space using PCA dimensionality reduction algorithm, instead of random distributions. Our work highlights the potential of Transformer model in batch training for constructing molecular representation model, thus providing new path for AI technology in molecular modeling.

引用

页数：12

共 50 条

[1] Learning a Fixed-Length Fingerprint Representation
Engelsma, Joshua J.
Cao, Kai
Jain, Anil K.
IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (06) : 1981 - 1997
[2] From fixed-length to arbitrary-length RSA padding schemes
Coron, JS
Koeune, F
Naccache, D
ADVANCES IN CRYPTOLOGY ASIACRYPT 2000, PROCEEDINGS, 2000, 1976 : 90 - 96
[3] IFViT: Interpretable Fixed-Length Representation for Fingerprint Matching via Vision Transformer
Qiu, Yuhang
Chen, Honghui
Dong, Xingbo
Lin, Zheng
Yi Liao, Iman
Tistarelli, Massimo
Jin, Zhe
IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 559 - 573
[4] Spectral Minutiae: A Fixed-length Representation of a Minutiae Set
Xu, Haiyun
Veldhuis, Raymond N. J.
Kevenaar, Tom A. M.
Akkermans, Anton H. M.
Bazen, Asker M.
2008 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, VOLS 1-3, 2008, : 1234 - +
[5] From fixed-length messages to arbitrary-length messages practical RSA signature padding schemes
Arboit, G
Robert, JM
TOPICS IN CRYPTOLOGY - CT-RAS 2001, PROCEEDINGS, 2001, 2020 : 44 - 51
[6] A fixed-length transmitted compression method for terrestrial HDTV
Wang, QS
Wei, YQ
5TH INTERNATIONAL SYMPOSIUM ON BROADCASTING TECHNOLOGY, PROCEEDINGS (ISBT'97, BEIJING), 1997, : 88 - 92
[7] METHOD FOR CONVERTING SYLK TO FIXED-LENGTH FILE.
Anon
IBM technical disclosure bulletin, 1986, 28 (10):
[8] Graph Representation Learning-Based Fixed-Length Clinical Feature Vector Generation from Heterogeneous Medical Records
Seki, Tomohisa
Kawazoe, Yoshimasa
Ohe, Kazuhiko
MEDINFO 2023 - THE FUTURE IS ACCESSIBLE, 2024, 310 : 715 - 719
[9] Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
Dai, Zihang
Yang, Zhilin
Yang, Yiming
Carbonell, Jaime
Le, Quoc V.
Salakhutdinov, Ruslan
57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2978 - 2988
[10] Document Alignment Based on Overlapping Fixed-Length Segments
Wang, Xiaotian
Utsuro, Takehito
Nagata, Masaaki
PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 51 - 61

← 1 2 3 4 5 →