Molecular representation learning based on Transformer with fixed-length padding method

被引:0
|
作者
Wu, Yichu [1 ]
Yang, Yang [1 ]
Zhang, Ruimeng [1 ]
Chen, Zijian [1 ]
Jin, Meichen [1 ]
Zou, Yi [1 ]
Wang, Zhonghua [1 ,2 ]
Wu, Fanhong [1 ,2 ]
机构
[1] Shanghai Inst Technol, Sch Chem & Environm Engn, Shanghai 201418, Peoples R China
[2] Shanghai Engn Res Ctr Green Fluoropharmaceut Techn, Shanghai 201418, Peoples R China
基金
中国国家自然科学基金;
关键词
Molecular representation learning; Transformer; Fixed-length padding; Molecular property prediction; NOMENCLATURE; DATABASE;
D O I
10.1016/j.molstruc.2024.139574
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Effective molecular representation learning plays an important role in molecular modeling process of drug design, protein engineering, material science and so on. Currently, self-supervised learning models based on the Transformer architecture have shown great promise in molecular representation. However, batch training of Transformer model requires input data of consistent length, while the length of each entry's molecular data (SMILES sequence) is inconsistent, which results in the model being unable to process batches directly. Therefore, corresponding strategies should be proposed to enable the model to smoothly process data with inconsistent length in batches. In this work, we adopt a strategy of head-tail padding and tail padding to obtain fixed-length data, which are employed as inputs for the Transformer encoder and decoder respectively, thus overcoming the limitation of the Transformer's inability to batch process input data with inconsistent length. In this way, our Transformer-based model can be used for batch training of molecular data, thereby improving the efficiency, accuracy, and simplicity of molecular representation learning. Subsequently, public datasets are used to evaluate the performance of our molecular representation model in predicting molecular property. In the classification and regression tasks, the average ROC-AUC and RMSE values improves by over 10.3% and 3.3% respectively compared to the baseline models. Furthermore, the specific distributions are found after the compressing molecular representation vectors into two-dimensional or three-dimensional space using PCA dimensionality reduction algorithm, instead of random distributions. Our work highlights the potential of Transformer model in batch training for constructing molecular representation model, thus providing new path for AI technology in molecular modeling.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Learning a Fixed-Length Fingerprint Representation
    Engelsma, Joshua J.
    Cao, Kai
    Jain, Anil K.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2021, 43 (06) : 1981 - 1997
  • [2] From fixed-length to arbitrary-length RSA padding schemes
    Coron, JS
    Koeune, F
    Naccache, D
    ADVANCES IN CRYPTOLOGY ASIACRYPT 2000, PROCEEDINGS, 2000, 1976 : 90 - 96
  • [3] IFViT: Interpretable Fixed-Length Representation for Fingerprint Matching via Vision Transformer
    Qiu, Yuhang
    Chen, Honghui
    Dong, Xingbo
    Lin, Zheng
    Yi Liao, Iman
    Tistarelli, Massimo
    Jin, Zhe
    IEEE TRANSACTIONS ON INFORMATION FORENSICS AND SECURITY, 2025, 20 : 559 - 573
  • [4] Spectral Minutiae: A Fixed-length Representation of a Minutiae Set
    Xu, Haiyun
    Veldhuis, Raymond N. J.
    Kevenaar, Tom A. M.
    Akkermans, Anton H. M.
    Bazen, Asker M.
    2008 IEEE COMPUTER SOCIETY CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, VOLS 1-3, 2008, : 1234 - +
  • [5] From fixed-length messages to arbitrary-length messages practical RSA signature padding schemes
    Arboit, G
    Robert, JM
    TOPICS IN CRYPTOLOGY - CT-RAS 2001, PROCEEDINGS, 2001, 2020 : 44 - 51
  • [6] A fixed-length transmitted compression method for terrestrial HDTV
    Wang, QS
    Wei, YQ
    5TH INTERNATIONAL SYMPOSIUM ON BROADCASTING TECHNOLOGY, PROCEEDINGS (ISBT'97, BEIJING), 1997, : 88 - 92
  • [7] METHOD FOR CONVERTING SYLK TO FIXED-LENGTH FILE.
    Anon
    IBM technical disclosure bulletin, 1986, 28 (10):
  • [8] Graph Representation Learning-Based Fixed-Length Clinical Feature Vector Generation from Heterogeneous Medical Records
    Seki, Tomohisa
    Kawazoe, Yoshimasa
    Ohe, Kazuhiko
    MEDINFO 2023 - THE FUTURE IS ACCESSIBLE, 2024, 310 : 715 - 719
  • [9] Transformer-XL: Attentive Language Models Beyond a Fixed-Length Context
    Dai, Zihang
    Yang, Zhilin
    Yang, Yiming
    Carbonell, Jaime
    Le, Quoc V.
    Salakhutdinov, Ruslan
    57TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2019), 2019, : 2978 - 2988
  • [10] Document Alignment Based on Overlapping Fixed-Length Segments
    Wang, Xiaotian
    Utsuro, Takehito
    Nagata, Masaaki
    PROCEEDINGS OF THE 62ND ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, VOL 4: STUDENT RESEARCH WORKSHOP, 2024, : 51 - 61