protein language model (PLM);
ESM-2;
Domain segmentation;
Recursive Cut (RecCut);
Discrete Cosine Transformation (DCT);
DCT fingerprint;
Homology detection;
LANGUAGE;
D O I:
10.1007/978-1-0716-3989-4_44
中图分类号:
TP301 [理论、方法];
学科分类号:
081202 ;
摘要:
Recently developed protein language models have enabled a variety of applications of the protein contextual embeddings. Per-protein representations (each protein is represented as a vector of fixed dimension) can be derived via averaging the embeddings of individual residues, or applying matrix transformation techniques such as the discrete cosine transformation to matrices of residue embeddings. Such protein-level embeddings have been applied to enable fast searches of similar proteins, however limitations have been found; for example, PROST is good at detecting global homologs but not local homologs, and knnProtT5 excels for proteins of single domains but not multi-domain proteins. Here we propose a novel approach that first segments proteins into domains and then applies discrete cosine transformation to the vectorized embeddings of residues in each domain to infer domain-level contextual vectors. Our approach called DCTdomain utilizes predicted contact maps from ESM-2 for domain segmentation, which is formulated as a domain segmentation problem and can be solved using a recursive cut algorithm (RecCut in short) in quadratic time to the protein length. We showed such domain-level contextual vectors (termed as DCT fingerprints) enable fast and accurate detection of similarity between proteins that share global similarities but with undefined extended regions between shared domains, and those that only share local similarities.
机构:
Indiana Univ, Sch Informat & Comp, Bloomington, IN 47408 USAIndiana Univ, Sch Informat & Comp, Bloomington, IN 47408 USA
Ye, Yuzhen
Choi, Jeong-Hyeon
论文数: 0引用数: 0
h-index: 0
机构:
Indiana Univ, Ctr Genom & Bioinformat, Bloomington, IN 47405 USAIndiana Univ, Sch Informat & Comp, Bloomington, IN 47408 USA
Choi, Jeong-Hyeon
Tang, Haixu
论文数: 0引用数: 0
h-index: 0
机构:
Indiana Univ, Sch Informat & Comp, Bloomington, IN 47408 USA
Indiana Univ, Ctr Genom & Bioinformat, Bloomington, IN 47405 USAIndiana Univ, Sch Informat & Comp, Bloomington, IN 47408 USA