Unique function words characterize genomic proteins

被引:5
|
作者
Scaiewicz, Andrea [1 ]
Levitt, Michael [1 ]
机构
[1] Stanford Univ, Sch Med, Dept Struct Biol, Stanford, CA 94305 USA
基金
美国国家科学基金会;
关键词
protein universe; genomic sequences; functional profiles; domain architecture; shared function; EVOLUTION; SUPERFAMILIES; HOMOLOGY; DATABASE; UNIVERSE; IMPACT;
D O I
10.1073/pnas.1801182115
中图分类号
O [数理科学和化学]; P [天文学、地球科学]; Q [生物科学]; N [自然科学总论];
学科分类号
07 ; 0710 ; 09 ;
摘要
Between 2009 and 2016 the number of protein sequences from known species increased 10-fold from 8 million to 85 million. About 80% of these sequences contain at least one region recognized by the conserved domain architecture retrieval tool (CDART) as a sequence motif. Motifs provide clues to biological function but CDART often matches the same region of a protein by two or more profiles. Such synonyms complicate estimates of functional complexity. We do full-linkage clustering of redundant profiles by finding maximum disjoint cliques: Each cluster is replaced by a single representative profile to give what we term a unique function word (UFW). From 2009 to 2016, the number of sequence profiles used by CDART increased by 80%; the number of UFW5 increased more slowly by 30%, indicating that the number of UFW5 may be saturating. The number of sequences matched by a single UFW (sequences with single domain architectures) increased as slowly as the number of different words, whereas the number of sequences matched by a combination of two or more UFW5 in sequences with multiple domain architectures (MDAs) increased at the same rate as the total number of sequences. This combinatorial arrangement of a limited number of UFW5 in MDAs accounts for the genomic diversity of protein sequences. Although eukaryotes and prokaryotes use very similar sets of "words" or UFW5 (57% shared), the "sentences" (MDAs) are different (1.3% shared).
引用
收藏
页码:6703 / 6708
页数:6
相关论文
共 50 条
  • [31] Efficient computation of absent words in genomic sequences
    Herold, Julia
    Kurtz, Stefan
    Giegerich, Robert
    BMC BIOINFORMATICS, 2008, 9 (1)
  • [32] Efficient computation of absent words in genomic sequences
    Julia Herold
    Stefan Kurtz
    Robert Giegerich
    BMC Bioinformatics, 9
  • [33] Compressibility as a unique means to detect and characterize globular protein states
    Chalikian, TV
    Breslauer, KJ
    BIOPHYSICAL JOURNAL, 1996, 70 (02) : SUPM2 - SUPM2
  • [34] P systems with proteins on membranes characterize PSPACE
    Sosik, Petr
    Paun, Andrei
    Rodriguez-Paton, Alfonso
    THEORETICAL COMPUTER SCIENCE, 2013, 488 : 78 - 95
  • [35] INVESTIGATIONS TO CHARACTERIZE WATER SOLUBLE EPIDERMAL PROTEINS
    LEONHARDI, G
    LOHNER, L
    GURENCI, J
    ARCHIV FUR KLINISCHE UND EXPERIMENTELLE DERMATOLOGIE, 1970, 237 (02): : 662 - +
  • [36] GENOMIC ORGANIZATION, STRUCTURE AND POSSIBLE FUNCTION OF HISTIDINE-RICH PROTEINS OF MALARIA PARASITES
    SHARMA, YD
    INTERNATIONAL JOURNAL OF BIOCHEMISTRY, 1988, 20 (05): : 471 - 477
  • [37] UNIQUE METHOD TO REPRESENT PROTEINS
    ERHAN, S
    INTERNATIONAL JOURNAL OF BIO-MEDICAL COMPUTING, 1980, 11 (01): : 77 - 82
  • [38] The core and unique proteins of haloarchaea
    Capes, Melinda D.
    DasSarma, Priya
    DasSarma, Shiladitya
    BMC GENOMICS, 2012, 13
  • [39] Unique nickel sites in proteins
    Drennan, CL
    Schreiter, E
    Doukov, TI
    Chivers, PT
    Seravalli, J
    Ragsdale, SW
    JOURNAL OF INORGANIC BIOCHEMISTRY, 2003, 96 (01) : 47 - 47
  • [40] The core and unique proteins of haloarchaea
    Melinda D Capes
    Priya DasSarma
    Shiladitya DasSarma
    BMC Genomics, 13