Protein Domain Embeddings for Fast and Accurate Similarity Search

被引:0
|
作者
Iovino, Benjamin Giovanni [1 ]
Tang, Haixu [1 ]
Ye, Yuzhen [1 ]
机构
[1] Indiana Univ, Luddy Sch Informat Comp & Engn, 700 N Woodlawn Ave, Bloomington, IN 47408 USA
关键词
protein language model (PLM); ESM-2; Domain segmentation; Recursive Cut (RecCut); Discrete Cosine Transformation (DCT); DCT fingerprint; Homology detection; LANGUAGE;
D O I
10.1007/978-1-0716-3989-4_44
中图分类号
TP301 [理论、方法];
学科分类号
081202 ;
摘要
Recently developed protein language models have enabled a variety of applications of the protein contextual embeddings. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains and then applies discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach called DCTdomain utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities.
引用
收藏
页码:421 / 424
页数:4
相关论文
共 50 条
  • [21] Fast similarity search on video signatures
    Cheung, SCS
    Zakhor, A
    2003 INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, VOL 2, PROCEEDINGS, 2003, : 1 - 4
  • [22] Adaptive Hashing for Fast Similarity Search
    Cakir, Fatih
    Sclaroff, Stan
    2015 IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV), 2015, : 1044 - 1052
  • [23] Multiscale Quantization for Fast Similarity Search
    Wu, Xiang
    Guo, Ruiqi
    Suresh, Ananda Theertha
    Kumar, Sanjiv
    Holtmann-Rice, Dan
    Simcha, David
    Yu, Felix X.
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 30 (NIPS 2017), 2017, 30
  • [24] Neural network embeddings based similarity search method for atomistic systems
    Yang, Yilin
    Liu, Mingjie
    Kitchin, John R.
    DIGITAL DISCOVERY, 2022, 1 (05): : 636 - 644
  • [25] CaPE: Category Preserving Embeddings for Similarity-Search in Financial Graphs
    Oberoi, Gaurav
    Poduval, Pranav
    Singh, Karamjit
    Verma, Sangam
    Gupta, Pranay
    3RD ACM INTERNATIONAL CONFERENCE ON AI IN FINANCE, ICAIF 2022, 2022, : 420 - 427
  • [26] PHOG-BLAST - a new generation tool for fast similarity search of protein families
    Merkeev, Igor V.
    Mironov, Andrey A.
    BMC EVOLUTIONARY BIOLOGY, 2006, 6 (1)
  • [27] PHOG-BLAST – a new generation tool for fast similarity search of protein families
    Igor V Merkeev
    Andrey A Mironov
    BMC Evolutionary Biology, 6
  • [28] Real-Valued Embeddings and Sketches for Fast Distance and Similarity Estimation
    Rachkovskij, D. A.
    CYBERNETICS AND SYSTEMS ANALYSIS, 2016, 52 (06) : 967 - 988
  • [29] Fast Similarity Search for Graphs by Edit Distance
    D. A. Rachkovskij
    Cybernetics and Systems Analysis, 2019, 55 : 1039 - 1051
  • [30] NeMa: Fast Graph Search with Label Similarity
    Khan, Arijit
    Wu, Yinghui
    Aggarwal, Charu C.
    Yan, Xifeng
    PROCEEDINGS OF THE VLDB ENDOWMENT, 2013, 6 (03): : 181 - 192