Protein embedding based alignment

被引:4
|
作者
Iovino, Benjamin Giovanni [1 ]
Ye, Yuzhen [1 ]
机构
[1] Indiana Univ, Luddy Sch Informat Comp & Engn, 700 N Woodlawn Ave, Bloomington, IN 47408 USA
关键词
Protein embedding; Protein sequence alignment; Smith-Waterman algorithm; Twilight zone; SEQUENCE; LANGUAGE; SEARCH;
D O I
10.1186/s12859-024-05699-5
中图分类号
Q5 [生物化学];
学科分类号
071010 ; 081704 ;
摘要
Purpose: Despite the many progresses with alignment algorithms, aligning divergent protein sequences with less than 20-35% pairwise identity (so called "twilight zone") remains a difficult problem. Many alignment algorithms have been using substitution matrices since their creation in the 1970's to generate alignments, however, these matrices do not work well to score alignments within the twilight zone. We developed Protein Embedding based Alignments, or PEbA, to better align sequences with low pairwise identity. Similar to the traditional Smith-Waterman algorithm, PEbA uses a dynamic programming algorithm but the matching score of amino acids is based on the similarity of their embeddings from a protein language model. Methods: We tested PEbA on over twelve thousand benchmark pairwise alignments from BAliBASE, each one extracted from one of their multiple sequence alignments. Five different BAliBASE references were used, each with different sequence identities, motifs, and lengths, allowing PEbA to showcase how well it aligns under different circumstances. Results: PEbA greatly outperformed BLOSUM substitution matrix-based pairwise alignments, achieving different levels of improvements of the alignment quality for pairs of sequences with different levels of similarity (over four times as well for pairs of sequences with <10% identity). We also compared PEbA with embeddings generated by different protein language models (ProtT5 and ESM-2) and found that ProtT5-XL-U50 produced the most useful embeddings for aligning protein sequences. PEbA also outperformed DEDAL and vcMSA, two recently developed protein language model embedding-based alignment methods. Conclusion: Our results suggested that general purpose protein language models provide useful contextual information for generating more accurate protein alignments than typically used methods.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] User Alignment Across Social Networks Based On ego-Network Embedding
    Zhen, Yu
    Hu, Ruimin
    Li, Dengshi
    Xiao, Yilin
    2022 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2022,
  • [32] PRASEMap: A Probabilistic Reasoning and Semantic Embedding based Knowledge Graph Alignment System
    Qi, Zhiyuan
    Zhang, Ziheng
    Chen, Jiaoyan
    Chen, Xi
    Zheng, Yefeng
    PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, CIKM 2021, 2021, : 4779 - 4783
  • [33] ICARUS: flexible protein structural alignment based on Protein Units
    Cretin, Gabriel
    Perin, Charlotte
    Zimmermann, Nicolas
    Galochkina, Tatiana
    Gelly, Jean-Christophe
    BIOINFORMATICS, 2023, 39 (08)
  • [34] Protein Structure Alignment Based on Internal Coordinates
    Shen, Yue-Feng
    Li, Bo
    Liu, Zhi-Ping
    INTERDISCIPLINARY SCIENCES-COMPUTATIONAL LIFE SCIENCES, 2010, 2 (04) : 308 - 319
  • [35] Protein structure alignment based on internal coordinates
    Yue-Feng Shen
    Bo Li
    Zhi-Ping Liu
    Interdisciplinary Sciences: Computational Life Sciences, 2010, 2 : 308 - 319
  • [36] Two-Stage Entity Alignment: Combining Hybrid Knowledge Graph Embedding with Similarity-Based Relation Alignment
    Jiang, Tingting
    Bu, Chenyang
    Zhu, Yi
    Wu, Xindong
    PRICAI 2019: TRENDS IN ARTIFICIAL INTELLIGENCE, PT I, 2019, 11670 : 162 - 175
  • [37] Graph matching using spectral embedding and alignment
    Bai, X
    Yu, H
    Hancock, ER
    PROCEEDINGS OF THE 17TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOL 3, 2004, : 398 - 401
  • [38] Optimizing the Accuracy of Randomized Embedding for Sequence Alignment
    Yan, Yiqing
    Chaturvedi, Nimisha
    Appuswamy, Raja
    Proceedings - 2022 IEEE 36th International Parallel and Distributed Processing Symposium Workshops, IPDPSW 2022, 2022, : 144 - 151
  • [39] Sentence Embedding Alignment for Lifelong Relation Extraction
    Wang, Hong
    Xiong, Wenhan
    Yu, Mo
    Guo, Xiaoxiao
    Chang, Shiyu
    Wang, William Yang
    2019 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES (NAACL HLT 2019), VOL. 1, 2019, : 796 - 806
  • [40] Bootstrapping Entity Alignment with Knowledge Graph Embedding
    Sun, Zequn
    Hu, Wei
    Zhang, Qingheng
    Qu, Yuzhong
    PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 4396 - 4402