Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

被引:3
|
作者
Neuwald, Andrew F. [1 ,2 ]
Lanczycki, Christoher J. [3 ]
Hodges, Theresa K. [1 ]
Marchler-Bauer, Aron [3 ]
机构
[1] Univ Maryland, Sch Med, Inst Genome Sci, 670 W Baltimore St, Baltimore, MD 21201 USA
[2] Univ Maryland, Sch Med, Dept Biochem & Mol Biol, 670 W Baltimore St, Baltimore, MD 21201 USA
[3] NLM, Natl Ctr Biotechnol Informat, NIH, Bldg 38 A,8600 Rockville Pike, Bethesda, MD 20894 USA
关键词
STRUCTURE PREDICTION; ERRORS;
D O I
10.1093/database/baaa042
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease-endonuclease-phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www. igs.umaryland.edu/labs/neuwald/software/mapgaps/.
引用
收藏
页数:8
相关论文
共 50 条
  • [1] PROMALS web server for accurate multiple protein sequence alignments
    Pei, Jimin
    Kim, Bong-Hyun
    Tang, Ming
    Grishin, Nick V.
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : W649 - W652
  • [2] The prediction of protein contacts from multiple sequence alignments
    Thomas, DJ
    Casari, G
    Sander, C
    [J]. PROTEIN ENGINEERING, 1996, 9 (11): : 941 - 948
  • [3] RPfam: A refiner towards curated-like multiple sequence alignments of the Pfam protein families
    Wei, Qingting
    Zou, Hong
    Zhong, Cuncong
    Xu, Jianfeng
    [J]. JOURNAL OF BIOINFORMATICS AND COMPUTATIONAL BIOLOGY, 2022, 20 (04)
  • [4] Multiple sequence alignments
    Wallace, IM
    Blackshields, G
    Higgins, DG
    [J]. CURRENT OPINION IN STRUCTURAL BIOLOGY, 2005, 15 (03) : 261 - 266
  • [5] ProtEST: protein multiple sequence alignments from expressed sequence tags
    Cuff, JA
    Birney, E
    Clamp, ME
    Barton, GJ
    [J]. BIOINFORMATICS, 2000, 16 (02) : 111 - 116
  • [6] COMPENSATING CHANGES IN PROTEIN MULTIPLE SEQUENCE ALIGNMENTS
    TAYLOR, WR
    HATRICK, K
    [J]. PROTEIN ENGINEERING, 1994, 7 (03): : 341 - 348
  • [7] Algorithms for locating extremely conserved elements in multiple sequence alignments
    Tseng, Huei-Hun E.
    Tompa, Martin
    [J]. BMC BIOINFORMATICS, 2009, 10
  • [8] Algorithms for locating extremely conserved elements in multiple sequence alignments
    Huei-Hun E Tseng
    Martin Tompa
    [J]. BMC Bioinformatics, 10
  • [9] Building multiple sequence alignments with a flavor of HSSP alignments
    Higa, Roberto Hiroshi
    Braga da Cruz, Sergio Aparecido
    Kuser, Paula Regina
    Beleza Yamagishi, Michel Eduardo
    Fileto, Renato
    de Medeiros Oliveira, Stanley Robson
    Mazoni, Ivan
    dos Santos, Edgard Henrique
    Mancini, Adauto Luiz
    Neshich, Goran
    [J]. GENETICS AND MOLECULAR RESEARCH, 2006, 5 (01): : 127 - 137
  • [10] Accurate Simulation and Detection of Coevolution Signals in Multiple Sequence Alignments
    Ackerman, Sharon H.
    Tillier, Elisabeth R.
    Gatti, Domenico L.
    [J]. PLOS ONE, 2012, 7 (10):