Protein Domain Embeddings for Fast and Accurate Similarity Search

被引:0
|
作者
Iovino, Benjamin Giovanni [1 ]
Tang, Haixu [1 ]
Ye, Yuzhen [1 ]
机构
[1] Indiana Univ, Luddy Sch Informat Comp & Engn, 700 N Woodlawn Ave, Bloomington, IN 47408 USA
关键词
protein language model (PLM); ESM-2; Domain segmentation; Recursive Cut (RecCut); Discrete Cosine Transformation (DCT); DCT fingerprint; Homology detection; LANGUAGE;
D O I
10.1007/978-1-0716-3989-4_44
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently developed protein language models have enabled a variety of applications of the protein contextual embeddings. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains and then applies discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach called DCTdomain utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities.
引用
收藏
页码:421 / 424
页数:4
相关论文
共 50 条
  • [1] Protein domain embeddings for fast and accurate similarity search
    Iovino, Benjamin Giovanni
    Tang, Haixu
    Ye, Yuzhen
    GENOME RESEARCH, 2024, 34 (09) : 1434 - 1444
  • [2] Fast and accurate protein structure search with Foldseek
    van Kempen, Michel
    Kim, Stephanie S.
    Tumescheit, Charlotte
    Mirdita, Milot
    Lee, Jeongjae
    Gilchrist, Cameron L. M.
    Soeding, Johannes
    Steinegger, Martin
    NATURE BIOTECHNOLOGY, 2024, 42 (02) : 243 - 246
  • [3] Fast and accurate protein structure search with Foldseek
    Michel van Kempen
    Stephanie S. Kim
    Charlotte Tumescheit
    Milot Mirdita
    Jeongjae Lee
    Cameron L. M. Gilchrist
    Johannes Söding
    Martin Steinegger
    Nature Biotechnology, 2024, 42 : 243 - 246
  • [4] PSI: indexing protein structures for fast similarity search
    Camoglu, Orhan
    Kahveci, Tamer
    Singh, Ambuj K.
    BIOINFORMATICS, 2003, 19 : i81 - i83
  • [5] PSimScan: Algorithm and Utility for Fast Protein Similarity Search
    Kaznadzey, Anna
    Alexandrova, Natalia
    Novichkov, Vladimir
    Kaznadzey, Denis
    PLOS ONE, 2013, 8 (03):
  • [6] Combining fast search and learning for fast similarity search
    Vassef, H
    Li, CS
    Castelli, V
    STORAGE AND RETRIEVAL FOR MEDIA DATABASES 2000, 2000, 3972 : 32 - 42
  • [7] RAPSearch: a fast protein similarity search tool for short reads
    Yuzhen Ye
    Jeong-Hyeon Choi
    Haixu Tang
    BMC Bioinformatics, 12
  • [8] RAPSearch: a fast protein similarity search tool for short reads
    Ye, Yuzhen
    Choi, Jeong-Hyeon
    Tang, Haixu
    BMC BIOINFORMATICS, 2011, 12
  • [9] RUPEE: A fast and accurate purely geometric protein structure search
    Ayoub, Ronald
    Lee, Yugyung
    PLOS ONE, 2019, 14 (03):
  • [10] Rank hash similarity for fast similarity search
    Lu, Min
    Huang, YaLou
    Xie, MaoQiang
    Liu, Jie
    INFORMATION PROCESSING & MANAGEMENT, 2013, 49 (01) : 158 - 168