Protein Domain Embeddings for Fast and Accurate Similarity Search

被引:0
|
作者
Iovino, Benjamin Giovanni [1 ]
Tang, Haixu [1 ]
Ye, Yuzhen [1 ]
机构
[1] Indiana Univ, Luddy Sch Informat Comp & Engn, 700 N Woodlawn Ave, Bloomington, IN 47408 USA
来源
RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY, RECOMB 2024 | 2024年 / 14758卷
关键词
protein language model (PLM); ESM-2; Domain segmentation; Recursive Cut (RecCut); Discrete Cosine Transformation (DCT); DCT fingerprint; Homology detection; LANGUAGE;
D O I
10.1007/978-1-0716-3989-4_44
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently developed protein language models have enabled a variety of applications of the protein contextual embeddings. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains and then applies discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach called DCTdomain utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities.
引用
收藏
页码:421 / 424
页数:4
相关论文
共 50 条
  • [31] Fast Similarity Search for Graphs by Edit Distance
    Rachkovskij, D. A.
    CYBERNETICS AND SYSTEMS ANALYSIS, 2019, 55 (06) : 1039 - 1051
  • [32] Ranking Preserving Hashing for Fast Similarity Search
    Wang, Qifan
    Zhang, Zhiwei
    Si, Luo
    PROCEEDINGS OF THE TWENTY-FOURTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE (IJCAI), 2015, : 3911 - 3917
  • [33] BTS: a fast approach for similarity search in sequences
    Jin, Bi
    Rong, Gang
    WCICA 2006: SIXTH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION, VOLS 1-12, CONFERENCE PROCEEDINGS, 2006, : 5933 - +
  • [34] Flexible and Fast Similarity Search for Enriched Trajectories
    Ohashi, Hideaki
    Shimizu, Toshiyuki
    Yoshikawa, Masatoshi
    IEICE TRANSACTIONS ON INFORMATION AND SYSTEMS, 2017, E100D (09): : 2081 - 2091
  • [35] Hash Code Reconstruction for Fast Similarity Search
    Li, Peng
    Zhu, Xiaobin
    Zhang, Xiaoyu
    Ren, Peng
    Wang, Lei
    IEEE SIGNAL PROCESSING LETTERS, 2019, 26 (05) : 695 - 699
  • [36] Fast Search Based on Generalized Similarity Measure
    Utsumi, Yuzuko
    Mizuno, Tomoya
    Iwamura, Masakazu
    Kise, Koichi
    PROCEEDINGS OF THE FIFTEENTH IAPR INTERNATIONAL CONFERENCE ON MACHINE VISION APPLICATIONS - MVA2017, 2017, : 181 - 185
  • [37] SCALABLE FOREST HASHING FOR FAST SIMILARITY SEARCH
    Yu, Gang
    Yuan, Junsong
    2014 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2014,
  • [38] A Fast Similarity Search kNN for Textual Datasets
    Amorim, Leonardo Afonso
    Freitas, Mateus F.
    da Silva, Paulo Henrique
    Martins, Wellington S.
    2018 SYMPOSIUM ON HIGH PERFORMANCE COMPUTING SYSTEMS (WSCAD 2018), 2018, : 229 - 236
  • [39] Fast Matching for All Pairs Similarity Search
    Awekar, Amit
    Samatova, Nagiza F.
    2009 IEEE/WIC/ACM INTERNATIONAL JOINT CONFERENCES ON WEB INTELLIGENCE (WI) AND INTELLIGENT AGENT TECHNOLOGIES (IAT), VOL 1, 2009, : 295 - +
  • [40] Fine-tuning protein embeddings for functional similarity evaluation
    Dickson, Andrew
    Mofrad, Mohammad R. K.
    BIOINFORMATICS, 2024, 40 (08)