Identification and distribution of protein families in 120 completed genomes using Gene3D

被引:23
|
作者
Lee, D [1 ]
Grant, A [1 ]
Marsden, RL [1 ]
Orengo, C [1 ]
机构
[1] UCL, Dept Biochem, Biomol Struct & Modeling Grp, London WC1E 6BT, England
关键词
protein domain architecture; domain partnerships; structural genomics;
D O I
10.1002/prot.20409
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
Using a new protocol, PFscape, we undertake a systematic identification of protein families and domain architectures in 120 complete genomes. PFscape clusters sequences into protein families using a Markov clustering algorithm (Enright et al., Nucleic Acids Res 2002;30:1575-1584) followed by complete linkage clustering according to sequence identity. Within each protein family, domains are recognized using a library of hidden Markov models comprising CATH structural and Pfam functional domains. Domain architectures are then determined using DomainFinder (Pearl et al., Protein Sci 2002; 11:233-244) and the protein family and domain architecture data are amalgamated in the Gene3D database (Buchan et al., Genome Res 2002;12:503-514). Using Gene3D, we have investigated protein sequence space, the extent of structural annotation, and the distribution of different domain architectures in completed genomes from all kingdoms of life. As with earlier studies by other researchers, the distribution of domain families shows power-law behavior such that the largest 2,000 domain families can be mapped to similar to 70% of nonsingleton genome sequences; the remaining sequences are assigned to much smaller families. While similar to 50% of domain annotations within a genome are assigned to 219 universal domain families, a much smaller proportion (< 10%) of protein sequences are assigned to universal protein families. This supports the mosaic theory of evolution whereby domain duplication followed by domain shuffling gives rise to novel domain architectures that can expand the protein functional repertoire of an organism. Functional data (e.g. COG/KEGG/GO) integrated within Gene3D result in a comprehensive resource that is currently being used in structure genomics initiatives and can be accessed via http://www. biochem.ucl.ac.uk/bsm/cath/Gene3D/. (c) 2005 Wiley-Liss, Inc.
引用
收藏
页码:603 / 615
页数:13
相关论文
共 50 条
  • [31] Inferring functional constraints and divergence in protein families using 3D mapping of phylogenetic information
    Blouin, C
    Boucher, Y
    Roger, AJ
    NUCLEIC ACIDS RESEARCH, 2003, 31 (02) : 790 - 797
  • [32] 3D protein structure similarity comparison using a shape distribution method
    Zhou, Ying
    Zhang, Kaixing
    Ma, Yuankui
    2008 INTERNATIONAL SPECIAL TOPIC CONFERENCE ON INFORMATION TECHNOLOGY AND APPLICATIONS IN BIOMEDICINE, VOLS 1 AND 2, 2008, : 556 - +
  • [33] IDENTIFICATION OF 6 MUTATIONS IN THE PROTEIN C GENE (PROC) IN A PANEL OF 83 SPANISH FAMILIES WITH PROTEIN C DEFICIENCY
    Martos, L.
    Bonet, E.
    Medina, P.
    Vaya, A.
    Lecumberri, R.
    Ferrando, F.
    Mira, Y.
    Marco, P.
    Gonzalez-Lopez, Tomas J.
    Hermida, J.
    Ibanez, F.
    Montes, R.
    Estelles, A.
    Bonanad, S.
    Navarro Rosales, S.
    Espana, F.
    THROMBOSIS RESEARCH, 2014, 133 : S78 - S78
  • [34] A new method for identification of protein (sub)families in a set of proteins based on hydropathy distribution in proteins
    Pánek, J
    Eidhammer, I
    Aasland, R
    PROTEINS-STRUCTURE FUNCTION AND BIOINFORMATICS, 2005, 58 (04) : 923 - 934
  • [35] Identification of homologous protein models via 3D comparisons using predicted structures
    Pan, Anyu
    Shentu, Jieyi
    Zeng, Yangfan
    Guo, Rong
    Yu, Yang
    STAR PROTOCOLS, 2024, 5 (01):
  • [36] 3D-interologs: an evolution database of physical protein- protein interactions across multiple genomes
    Yu-Shu Lo
    Yung-Chiang Chen
    Jinn-Moon Yang
    BMC Genomics, 11
  • [37] 3D-interologs: an evolution database of physical protein- protein interactions across multiple genomes
    Lo, Yu-Shu
    Chen, Yung-Chiang
    Yang, Jinn-Moon
    BMC GENOMICS, 2010, 11
  • [38] Revisiting the protein-coding gene catalog of Drosophila melanogaster using 12 fly genomes
    Lin, Michael F.
    Carlson, Joseph W.
    Crosby, Madeline A.
    Matthews, Beverley B.
    Yu, Charles
    Park, Soo
    Wan, Kenneth H.
    Schroeder, Andrew J.
    Gramates, L. Sian
    Pierre, Susan E. St.
    Roark, Margaret
    Wiley, Kenneth L., Jr.
    Kulathinal, Rob J.
    Zhang, Peili
    Myrick, Kyl V.
    Antone, Jerry V.
    Celniker, Susan E.
    Gelbart, William M.
    Kellis, Manolis
    GENOME RESEARCH, 2007, 17 (12) : 1823 - 1836
  • [39] Identification and analysis of gene families from the duplicated genome of soybean using EST sequences
    Nelson, Rex T.
    Shoemaker, Randy
    BMC GENOMICS, 2006, 7 (1)
  • [40] Identification and characterization of 11 anthocyanin biosynthesis gene families in multiple plant genomes and potential light-independent anthocyanin biosynthesis in blueberry
    Wang, Xuxiang
    Hu, Yiting
    Dong, Jiajia
    Lu, Xiaoying
    Huang, Qiaoyu
    Huang, Yilin
    Sheng, Mingyang
    Li, Yongqiang
    Sun, Ping
    Zong, Yu
    Guo, Weidong
    SCIENTIA HORTICULTURAE, 2025, 342