Protein embedding based alignment

被引:4
|
作者
Iovino, Benjamin Giovanni [1 ]
Ye, Yuzhen [1 ]
机构
[1] Indiana Univ, Luddy Sch Informat Comp & Engn, 700 N Woodlawn Ave, Bloomington, IN 47408 USA
关键词
Protein embedding; Protein sequence alignment; Smith-Waterman algorithm; Twilight zone; SEQUENCE; LANGUAGE; SEARCH;
D O I
10.1186/s12859-024-05699-5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Purpose: Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. Methods: We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. Results: PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. Conclusion: Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Semantic Embedding-Based Entity Alignment for Cybersecurity Knowledge Graphs
    Kim, Minhwan
    Kim, Hanmin
    Park, Gyudong
    Sohn, Mye
    MOBILE INTERNET SECURITY, MOBISEC 2021, 2022, 1544 : 52 - 64
  • [22] A Benchmarking Study of Embedding-based Entity Alignment for Knowledge Graphs
    Sun, Zequn
    Zhang, Qingheng
    Hu, Wei
    Wang, Chengming
    Chen, Muhao
    Akrami, Farahnaz
    Li, Chengkai
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2020, 13 (11): : 2326 - 2340
  • [23] Fast Alignment and Calibration of Rotational Inertial System Based on Bilinear Embedding
    Li, Jun
    Zhang, Shifeng
    Yang, Huabo
    Jiang, Zhenyu
    Bai, Xibin
    IEEE SENSORS JOURNAL, 2024, 24 (07) : 10700 - 10713
  • [24] Revisiting Embedding-Based Entity Alignment: A Robust and Adaptive Method
    Sun, Zequn
    Hu, Wei
    Wang, Chengming
    Wang, Yuxin
    Qu, Yuzhong
    IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, 2023, 35 (08) : 8461 - 8475
  • [25] Local Patches Alignment Embedding Based Localization for Wireless Sensor Networks
    Liu, Yang
    Chen, Jing
    Zhan, Yi-ju
    WIRELESS PERSONAL COMMUNICATIONS, 2013, 70 (01) : 373 - 389
  • [26] Local Patches Alignment Embedding Based Localization for Wireless Sensor Networks
    Yang Liu
    Jing Chen
    Yi-ju Zhan
    Wireless Personal Communications, 2013, 70 : 373 - 389
  • [27] Closed form word embedding alignment
    Dev, Sunipa
    Hassan, Safia
    Phillips, Jeff M.
    KNOWLEDGE AND INFORMATION SYSTEMS, 2021, 63 (03) : 565 - 588
  • [28] Closed Form Word Embedding Alignment
    Dev, Sunipa
    Hassan, Safia
    Phillips, Jeff M.
    2019 19TH IEEE INTERNATIONAL CONFERENCE ON DATA MINING (ICDM 2019), 2019, : 130 - 139
  • [29] Joint Triple Embedding for Entity Alignment
    Li, Fengying
    Li, Jiapeng
    Computer Engineering and Applications, 2023, 59 (24) : 70 - 77
  • [30] Closed form word embedding alignment
    Sunipa Dev
    Safia Hassan
    Jeff M. Phillips
    Knowledge and Information Systems, 2021, 63 : 565 - 588