Unique function words characterize genomic proteins

被引:5
|
作者
Scaiewicz, Andrea [1 ]
Levitt, Michael [1 ]
机构
[1] Stanford Univ, Sch Med, Dept Struct Biol, Stanford, CA 94305 USA
基金
美国国家科学基金会;
关键词
protein universe; genomic sequences; functional profiles; domain architecture; shared function; EVOLUTION; SUPERFAMILIES; HOMOLOGY; DATABASE; UNIVERSE; IMPACT;
D O I
10.1073/pnas.1801182115
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFW5 increased more slowly by 30%, indicating that the number of UFW5 may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFW5 in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFW5 in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of "words" or UFW5 (57% shared), the "sentences" (MDAs) are different (1.3% shared).
引用
收藏
页码:6703 / 6708
页数:6
相关论文
共 50 条
  • [21] The counting principle makes number words unique
    Ariel, Mira
    Levshina, Natalia
    CORPUS LINGUISTICS AND LINGUISTIC THEORY, 2025, 21 (01) : 173 - 199
  • [22] Partial words with a unique position starting a square
    Machacek, John
    INFORMATION PROCESSING LETTERS, 2019, 145 : 44 - 47
  • [23] Representing bacteria with unique genomic signatures
    Pham, Diem-Trang
    Phan, Vinhthuy
    FRONTIERS IN BIG DATA, 2022, 5
  • [24] Cofactor fingerprinting with STD NMR to characterize proteins of unknown function: identification of a rare cCMP cofactor preference
    Yao, HL
    Sem, DS
    FEBS LETTERS, 2005, 579 (03) : 661 - 666
  • [25] Cofactor fingerprinting with STD NMR to characterize proteins of unknown function: identification of a rare cCMP cofactor preference
    Yao, HL
    Sem, DS
    FASEB JOURNAL, 2005, 19 (04): : A851 - A852
  • [26] Function words in Polish
    Levin-Steinmann, A
    ZEITSCHRIFT FUR SLAVISCHE PHILOLOGIE, 2001, 60 (01): : 223 - 236
  • [27] THE FUNCTION OF WORDS IN 'CANDIDE'
    GILOT, M
    LITTERATURES, 1984, (9-10): : 91 - 97
  • [28] Identification, genomic organization and mRNA expression of CRELD1, the founding member of a unique family of matricellular proteins
    Rupp, PA
    Fouad, GT
    Egelston, CA
    Reifsteck, CA
    Olson, SB
    Knosp, WM
    Glanville, RW
    Thornburg, KL
    Robinson, SW
    Maslen, CL
    GENE, 2002, 293 (1-2) : 47 - 57
  • [29] UNIQUE METABOLIC PROFILES CHARACTERIZE ADULTS WITH NAFLD AND CARDIOVASCULAR DISEASE
    Corey, Kathleen E.
    Osganian, Stephanie
    Zheng, Hui
    Morningstar, Jordan
    Costentin, Charlotte E. Laurent
    Simon, Tracey G.
    Masia, Ricard
    Gerszten, Robert
    HEPATOLOGY, 2019, 70 : 718A - 718A
  • [30] HELANAL: A program to characterize helix geometry in proteins
    Bansal, M
    Kumar, S
    Velavan, R
    JOURNAL OF BIOMOLECULAR STRUCTURE & DYNAMICS, 2000, 17 (05): : 811 - 819