Molecular representation learning based on Transformer with fixed-length padding method

被引:0
|
作者
Wu, Yichu [1 ]
Yang, Yang [1 ]
Zhang, Ruimeng [1 ]
Chen, Zijian [1 ]
Jin, Meichen [1 ]
Zou, Yi [1 ]
Wang, Zhonghua [1 ,2 ]
Wu, Fanhong [1 ,2 ]
机构
[1] Shanghai Inst Technol, Sch Chem & Environm Engn, Shanghai 201418, Peoples R China
[2] Shanghai Engn Res Ctr Green Fluoropharmaceut Techn, Shanghai 201418, Peoples R China
基金
中国国家自然科学基金;
关键词
Molecular representation learning; Transformer; Fixed-length padding; Molecular property prediction; NOMENCLATURE; DATABASE;
D O I
10.1016/j.molstruc.2024.139574
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Effective molecular representation learning plays an important role in molecular modeling process of drug design, protein engineering, material science and so on. Currently, self-supervised learning models based on the Transformer architecture have shown great promise in molecular representation. However, batch training of Transformer model requires input data of consistent length, while the length of each entry's molecular data (SMILES sequence) is inconsistent, which results in the model being unable to process batches directly. Therefore, corresponding strategies should be proposed to enable the model to smoothly process data with inconsistent length in batches. In this work, we adopt a strategy of head-tail padding and tail padding to obtain fixed-length data, which are employed as inputs for the Transformer encoder and decoder respectively, thus overcoming the limitation of the Transformer's inability to batch process input data with inconsistent length. In this way, our Transformer-based model can be used for batch training of molecular data, thereby improving the efficiency, accuracy, and simplicity of molecular representation learning. Subsequently, public datasets are used to evaluate the performance of our molecular representation model in predicting molecular property. In the classification and regression tasks, the average ROC-AUC and RMSE values improves by over 10.3% and 3.3% respectively compared to the baseline models. Furthermore, the specific distributions are found after the compressing molecular representation vectors into two-dimensional or three-dimensional space using PCA dimensionality reduction algorithm, instead of random distributions. Our work highlights the potential of Transformer model in batch training for constructing molecular representation model, thus providing new path for AI technology in molecular modeling.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] A Topological Charge Continuously Tunable Orbital Angular Momentum (OAM) Electromagnetic Wave Generation Method Based on Fixed-length Delay Line Mixing Circuit
    Zhou, Yuliang
    Li, Xiaona
    Yao, Kaiyuan
    Huang, Yong Mao
    Jin, Haiyan
    APPLIED COMPUTATIONAL ELECTROMAGNETICS SOCIETY JOURNAL, 2022, 37 (10): : 1071 - 1076
  • [42] A Fixed-Length Transfer Delay Based Adaptive Frequency-Locked Loop for Single-Phase Systems
    Dai, Zhiyong
    Zhang, Zhen
    Yang, Yongheng
    Blaabjerg, Frede
    Huangfu, Yigeng
    Zhang, Juxiang
    IEEE TRANSACTIONS ON POWER ELECTRONICS, 2019, 34 (05) : 4000 - 4004
  • [43] A Generic Framework for Scan Capture Power Reduction in Fixed-Length Symbol-based Test Compression Environment
    Liu, Xiao
    Xu, Qiang
    DATE: 2009 DESIGN, AUTOMATION & TEST IN EUROPE CONFERENCE & EXHIBITION, VOLS 1-3, 2009, : 1494 - 1499
  • [44] A dynamic graph representation learning based on temporal graph transformer
    Zhong, Ying
    Huang, Chenze
    ALEXANDRIA ENGINEERING JOURNAL, 2023, 63 : 359 - 369
  • [45] Transformer-Based Representation Learning on Temporal Heterogeneous Graphs
    Li, Longhai
    Duan, Lei
    Wang, Junchen
    Xie, Guicai
    He, Chengxin
    Chen, Zihao
    Deng, Song
    WEB AND BIG DATA, PT II, APWEB-WAIM 2022, 2023, 13422 : 385 - 400
  • [46] A dynamic graph representation learning based on temporal graph transformer
    Zhong, Ying
    Huang, Chenze
    ALEXANDRIA ENGINEERING JOURNAL, 2023, 63 : 359 - 369
  • [47] Arabic Speech Classification Method Based on Padding and Deep Learning Neural Network
    Asroni, Asroni
    Ku-Mahamud, Ku Ruhana
    Damarjati, Cahya
    Slamat, Hasan Basri
    BAGHDAD SCIENCE JOURNAL, 2021, 18 (02) : 925 - 936
  • [49] AdaMGT: Molecular representation learning via adaptive mixture of GCN-Transformer
    Ding, Cangfeng
    Yan, Zhaoyao
    Ma, Lerong
    Cao, Bohao
    Cao, Lu
    KNOWLEDGE-BASED SYSTEMS, 2025, 311
  • [50] Molecular representation contrastive learning via transformer embedding to graph neural networks
    Liu, Yunwu
    Zhang, Ruisheng
    Li, Tongfeng
    Jiang, Jing
    Ma, Jun
    Yuan, Yongna
    Wang, Ping
    APPLIED SOFT COMPUTING, 2024, 164