Molecular representation learning based on Transformer with fixed-length padding method

被引:0
|
作者
Wu, Yichu [1 ]
Yang, Yang [1 ]
Zhang, Ruimeng [1 ]
Chen, Zijian [1 ]
Jin, Meichen [1 ]
Zou, Yi [1 ]
Wang, Zhonghua [1 ,2 ]
Wu, Fanhong [1 ,2 ]
机构
[1] Shanghai Inst Technol, Sch Chem & Environm Engn, Shanghai 201418, Peoples R China
[2] Shanghai Engn Res Ctr Green Fluoropharmaceut Techn, Shanghai 201418, Peoples R China
基金
中国国家自然科学基金;
关键词
Molecular representation learning; Transformer; Fixed-length padding; Molecular property prediction; NOMENCLATURE; DATABASE;
D O I
10.1016/j.molstruc.2024.139574
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Effective molecular representation learning plays an important role in molecular modeling process of drug design, protein engineering, material science and so on. Currently, self-supervised learning models based on the Transformer architecture have shown great promise in molecular representation. However, batch training of Transformer model requires input data of consistent length, while the length of each entry's molecular data (SMILES sequence) is inconsistent, which results in the model being unable to process batches directly. Therefore, corresponding strategies should be proposed to enable the model to smoothly process data with inconsistent length in batches. In this work, we adopt a strategy of head-tail padding and tail padding to obtain fixed-length data, which are employed as inputs for the Transformer encoder and decoder respectively, thus overcoming the limitation of the Transformer's inability to batch process input data with inconsistent length. In this way, our Transformer-based model can be used for batch training of molecular data, thereby improving the efficiency, accuracy, and simplicity of molecular representation learning. Subsequently, public datasets are used to evaluate the performance of our molecular representation model in predicting molecular property. In the classification and regression tasks, the average ROC-AUC and RMSE values improves by over 10.3% and 3.3% respectively compared to the baseline models. Furthermore, the specific distributions are found after the compressing molecular representation vectors into two-dimensional or three-dimensional space using PCA dimensionality reduction algorithm, instead of random distributions. Our work highlights the potential of Transformer model in batch training for constructing molecular representation model, thus providing new path for AI technology in molecular modeling.
引用
收藏
页数:12
相关论文
共 50 条
  • [31] Fixed-Length Compression for Letter-Based Fidelity Measures in the Finite Blocklength Regime
    Palzer, Lars
    Timo, Roy
    2016 IEEE INTERNATIONAL SYMPOSIUM ON INFORMATION THEORY, 2016, : 2424 - 2428
  • [32] Implementation of Fixed-Length Template Protection Based on Homomorphic Encryption with Application to Signature Biometrics
    Gomez-Barrero, Marta
    Fierrez, Julian
    Galbally, Javier
    Maiorana, Emanuele
    Campisi, Patrizio
    PROCEEDINGS OF 29TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, (CVPRW 2016), 2016, : 259 - 266
  • [33] Ordered and fixed-length bit-string fingerprint representation with minutia vicinity combined feature and spectral clustering
    Li, Yuxing
    Zhao, Heng
    Cao, Zhicheng
    Liu, Eryun
    Pang, Liaojun
    IET IMAGE PROCESSING, 2020, 14 (16) : 4220 - 4228
  • [34] Fuzzy Vault Scheme Based on Fixed-Length Templates Applied to Dynamic Signature Verification
    Ponce-Hernandez, Wendy
    Blanco-Gonzalo, Ramon
    Liu-Jimenez, Judith
    Sanchez-Reillo, Raul
    IEEE ACCESS, 2020, 8 (08): : 11152 - 11164
  • [35] Memory-Efficient Fixed-Length Representation of Synchronous Event Frames for Very-Low-Power Chip Integration
    Schiopu, Ionut
    Bilcu, Radu Ciprian
    ELECTRONICS, 2023, 12 (10)
  • [36] Fixed-length roof cutting with vertical hydraulic fracture based on the stress shadow effect: A case study
    Zhang, Feiteng
    Wang, Xiangyu
    Bai, Jianbiao
    Wu, Wenda
    Wu, Bowen
    Wang, Guanghui
    INTERNATIONAL JOURNAL OF MINING SCIENCE AND TECHNOLOGY, 2022, 32 (02) : 295 - 308
  • [37] Fixed-length asymmetric binary hashing for fingerprint verification through GMM-SVM based representations
    Topcu, Berkay
    Erdogan, Hakan
    PATTERN RECOGNITION, 2019, 88 : 409 - 420
  • [38] A Novel Method to Build Intraoperative Flexibility Into Prefabricated or Fixed-Length Low-Dose-Rate Prostate Brachytherapy
    Powers, A. R.
    Sheu, R. D.
    McGee, H. M.
    Stock, R. G.
    INTERNATIONAL JOURNAL OF RADIATION ONCOLOGY BIOLOGY PHYSICS, 2016, 96 (02): : E685 - E685
  • [39] Fixed-length roof cutting with vertical hydraulic fracture based on the stress shadow effect: A case study
    Feiteng Zhang
    Xiangyu Wang
    Jianbiao Bai
    Wenda Wu
    Bowen Wu
    Guanghui Wang
    International Journal of Mining Science and Technology, 2022, 32 (02) : 295 - 308
  • [40] EXIT Chart Based System Design for Iterative Source-Channel Decoding with Fixed-Length Codes
    Schmalen, Laurent
    Adrat, Marc
    Clevorn, Thorsten
    Vary, Peter
    IEEE TRANSACTIONS ON COMMUNICATIONS, 2011, 59 (09) : 2406 - 2413