Protein embedding based alignment

被引:4
|
作者
Iovino, Benjamin Giovanni [1 ]
Ye, Yuzhen [1 ]
机构
[1] Indiana Univ, Luddy Sch Informat Comp & Engn, 700 N Woodlawn Ave, Bloomington, IN 47408 USA
关键词
Protein embedding; Protein sequence alignment; Smith-Waterman algorithm; Twilight zone; SEQUENCE; LANGUAGE; SEARCH;
D O I
10.1186/s12859-024-05699-5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Purpose: Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. Methods: We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. Results: PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. Conclusion: Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Protein embedding based alignment
    Benjamin Giovanni Iovino
    Yuzhen Ye
    BMC Bioinformatics, 25
  • [2] Deep embedding and alignment of protein sequences
    Felipe Llinares-López
    Quentin Berthet
    Mathieu Blondel
    Olivier Teboul
    Jean-Philippe Vert
    Nature Methods, 2023, 20 (1) : 104 - 111
  • [3] Deep embedding and alignment of protein sequences
    Llinares-Lopez, Felipe
    Berthet, Quentin
    Blondel, Mathieu
    Teboul, Olivier
    Vert, Jean-Philippe
    NATURE METHODS, 2023, 20 (01) : 104 - +
  • [4] Subspace clustering based on alignment and graph embedding
    Liao, Mengmeng
    Gu, Xiaodong
    KNOWLEDGE-BASED SYSTEMS, 2020, 188
  • [5] Multi-information embedding based entity alignment
    Chen, Ling
    Tian, Xiaoxue
    Tang, Xing
    Cui, Jun
    APPLIED INTELLIGENCE, 2021, 51 (12) : 8896 - 8912
  • [6] Landmark-based Local Patches Alignment Embedding
    Chen, Jing
    Liu, Yang
    2014 SIXTH INTERNATIONAL CONFERENCE ON INTELLIGENT HUMAN-MACHINE SYSTEMS AND CYBERNETICS (IHMSC), VOL 2, 2014, : 104 - 107
  • [7] Self-learning and embedding based entity alignment
    Saiping Guan
    Xiaolong Jin
    Yuanzhuo Wang
    Yantao Jia
    Huawei Shen
    Zixuan Li
    Xueqi Cheng
    Knowledge and Information Systems, 2019, 59 : 361 - 386
  • [8] Self-learning and embedding based entity alignment
    Guan, Saiping
    Jin, Xiaolong
    Wang, Yuanzhuo
    Jia, Yantao
    Shen, Huawei
    Li, Zixuan
    Cheng, Xueqi
    KNOWLEDGE AND INFORMATION SYSTEMS, 2019, 59 (02) : 361 - 386
  • [9] Self-learning and Embedding Based Entity Alignment
    Guan, Saiping
    Jin, Xiaolong
    Jia, Yantao
    Wang, Yuanzhuo
    Shen, Huawei
    Cheng, Xueqi
    2017 IEEE INTERNATIONAL CONFERENCE ON BIG KNOWLEDGE (IEEE ICBK 2017), 2017, : 33 - 40
  • [10] IDAGEmb: An Incremental Data Alignment Based on Graph Embedding
    El Haddadi, Oumaima
    Chevalier, Max
    Dousset, Bernard
    El Allaoui, Ahmad
    El Haddadi, Anass
    Teste, Olivier
    BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY, DAWAK 2024, 2024, 14912 : 19 - 33