Molecular representation learning based on Transformer with fixed-length padding method

被引：0

作者：

Wu, Yichu ^{[1
]}

Yang, Yang ^{[1
]}

Zhang, Ruimeng ^{[1
]}

Chen, Zijian ^{[1
]}

Jin, Meichen ^{[1
]}

Zou, Yi ^{[1
]}

Wang, Zhonghua ^{[1
,2
]}

Wu, Fanhong ^{[1
,2
]}

机构：

[1] Shanghai Inst Technol, Sch Chem & Environm Engn, Shanghai 201418, Peoples R China

[2] Shanghai Engn Res Ctr Green Fluoropharmaceut Techn, Shanghai 201418, Peoples R China

来源：

JOURNAL OF MOLECULAR STRUCTURE | 2025年 / 1319卷

基金：

中国国家自然科学基金;

关键词：

Molecular representation learning; Transformer; Fixed-length padding; Molecular property prediction; NOMENCLATURE; DATABASE;

D O I：

10.1016/j.molstruc.2024.139574

中图分类号：

O64 [物理化学（理论化学）、化学物理学];

学科分类号：

070304 ; 081704 ;

摘要：

Effective molecular representation learning plays an important role in molecular modeling process of drug design, protein engineering, material science and so on. Currently, self-supervised learning models based on the Transformer architecture have shown great promise in molecular representation. However, batch training of Transformer model requires input data of consistent length, while the length of each entry's molecular data (SMILES sequence) is inconsistent, which results in the model being unable to process batches directly. Therefore, corresponding strategies should be proposed to enable the model to smoothly process data with inconsistent length in batches. In this work, we adopt a strategy of head-tail padding and tail padding to obtain fixed-length data, which are employed as inputs for the Transformer encoder and decoder respectively, thus overcoming the limitation of the Transformer's inability to batch process input data with inconsistent length. In this way, our Transformer-based model can be used for batch training of molecular data, thereby improving the efficiency, accuracy, and simplicity of molecular representation learning. Subsequently, public datasets are used to evaluate the performance of our molecular representation model in predicting molecular property. In the classification and regression tasks, the average ROC-AUC and RMSE values improves by over 10.3% and 3.3% respectively compared to the baseline models. Furthermore, the specific distributions are found after the compressing molecular representation vectors into two-dimensional or three-dimensional space using PCA dimensionality reduction algorithm, instead of random distributions. Our work highlights the potential of Transformer model in batch training for constructing molecular representation model, thus providing new path for AI technology in molecular modeling.

引用

页数：12

共 50 条

[31] Fixed-Length Compression for Letter-Based Fidelity Measures in the Finite Blocklength Regime
Palzer, Lars
Timo, Roy
2016 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2016, : 2424 - 2428
[32] Implementation of Fixed-Length Template Protection Based on Homomorphic Encryption with Application to Signature Biometrics
Gomez-Barrero, Marta
Fierrez, Julian
Galbally, Javier
Maiorana, Emanuele
Campisi, Patrizio
PROCEEDINGS OF 29TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, (CVPRW 2016), 2016, : 259 - 266
[33] Ordered and fixed-length bit-string fingerprint representation with minutia vicinity combined feature and spectral clustering
Li, Yuxing
Zhao, Heng
Cao, Zhicheng
Liu, Eryun
Pang, Liaojun
IET IMAGE PROCESSING, 2020, 14 (16) : 4220 - 4228
[34] Fuzzy Vault Scheme Based on Fixed-Length Templates Applied to Dynamic Signature Verification
Ponce-Hernandez, Wendy
Blanco-Gonzalo, Ramon
Liu-Jimenez, Judith
Sanchez-Reillo, Raul
IEEE ACCESS, 2020, 8 (08): : 11152 - 11164
[35] Memory-Efficient Fixed-Length Representation of Synchronous Event Frames for Very-Low-Power Chip Integration
Schiopu, Ionut
Bilcu, Radu Ciprian
ELECTRONICS, 2023, 12 (10)
[36] Fixed-length roof cutting with vertical hydraulic fracture based on the stress shadow effect: A case study
Zhang, Feiteng
Wang, Xiangyu
Bai, Jianbiao
Wu, Wenda
Wu, Bowen
Wang, Guanghui
INTERNATIONAL JOURNAL OF MINING SCIENCE AND TECHNOLOGY, 2022, 32 (02) : 295 - 308
[37] Fixed-length asymmetric binary hashing for fingerprint verification through GMM-SVM based representations
Topcu, Berkay
Erdogan, Hakan
PATTERN RECOGNITION, 2019, 88 : 409 - 420
[38] A Novel Method to Build Intraoperative Flexibility Into Prefabricated or Fixed-Length Low-Dose-Rate Prostate Brachytherapy
Powers, A. R.
Sheu, R. D.
McGee, H. M.
Stock, R. G.
INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 2016, 96 (02): : E685 - E685
[39] Fixed-length roof cutting with vertical hydraulic fracture based on the stress shadow effect: A case study
Feiteng Zhang
Xiangyu Wang
Jianbiao Bai
Wenda Wu
Bowen Wu
Guanghui Wang
International Journal of Mining Science and Technology, 2022, 32 (02) : 295 - 308
[40] EXIT Chart Based System Design for Iterative Source-Channel Decoding with Fixed-Length Codes
Schmalen, Laurent
Adrat, Marc
Clevorn, Thorsten
Vary, Peter
IEEE TRANSACTIONS ON COMMUNICATIONS, 2011, 59 (09) : 2406 - 2413

← 1 2 3 4 5 →