Obtaining extremely large and accurate protein multiple sequence alignments from curated hierarchical alignments

被引:3
|
作者
Neuwald, Andrew F. [1 ,2 ]
Lanczycki, Christoher J. [3 ]
Hodges, Theresa K. [1 ]
Marchler-Bauer, Aron [3 ]
机构
[1] Univ Maryland, Sch Med, Inst Genome Sci, 670 W Baltimore St, Baltimore, MD 21201 USA
[2] Univ Maryland, Sch Med, Dept Biochem & Mol Biol, 670 W Baltimore St, Baltimore, MD 21201 USA
[3] NLM, Natl Ctr Biotechnol Informat, NIH, Bldg 38 A,8600 Rockville Pike, Bethesda, MD 20894 USA
关键词
STRUCTURE PREDICTION; ERRORS;
D O I
10.1093/database/baaa042
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
For optimal performance, machine learning methods for protein sequence/structural analysis typically require as input a large multiple sequence alignment (MSA), which is often created using query-based iterative programs, such as PSI-BLAST or JackHMMER. However, because these programs align database sequences using a query sequence as a template, they may fail to detect or may tend to misalign sequences distantly related to the query. More generally, automated MSA programs often fail to align sequences correctly due to the unpredictable nature of protein evolution. Addressing this problem typically requires manual curation in the light of structural data. However, curated MSAs tend to contain too few sequences to serve as input for statistically based methods. We address these shortcomings by making publicly available a set of 252 curated hierarchical MSAs (hiMSAs), containing a total of 26 212 066 sequences, along with programs for generating from these extremely large MSAs. Each hiMSA consists of a set of hierarchically arranged MSAs representing individual subgroups within a superfamily along with template MSAs specifying how to align each subgroup MSA against MSAs higher up the hierarchy. Central to this approach is the MAPGAPS search program, which uses a hiMSA as a query to align (potentially vast numbers of) matching database sequences with accuracy comparable to that of the curated hiMSA. We illustrate this process for the exonuclease-endonuclease-phosphatase superfamily and for pleckstrin homology domains. A set of extremely large MSAs generated from the hiMSAs in this way is available as input for deep learning, big data analyses. MAPGAPS, auxiliary programs CDD2MGS, AddPhylum, PurgeMSA and ConvertMSA and links to National Center for Biotechnology Information data files are available at https://www. igs.umaryland.edu/labs/neuwald/software/mapgaps/.
引用
收藏
页数:8
相关论文
共 50 条
  • [41] Rapid and Accurate Large-Scale Coestimation of Sequence Alignments and Phylogenetic Trees
    Liu, Kevin
    Raghavan, Sindhu
    Nelesen, Serita
    Linder, C. Randal
    Warnow, Tandy
    [J]. SCIENCE, 2009, 324 (5934) : 1561 - 1564
  • [42] ADVANTAGES OF USING MULTIPLE SEQUENCE ALIGNMENTS OVER PAIRWISE ALIGNMENTS WHEN SEQUENCE SIMILARITY IS LOW
    BABBITT, PC
    DUNAWAYMARIANO, D
    KENYON, GL
    [J]. BIOCHEMISTRY, 1992, 31 (07) : 2198 - 2198
  • [43] On the fidelity of protein sequence alignments.
    Sharma, KR
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2004, 227 : U240 - U240
  • [44] Twilight zone of protein sequence alignments
    Rost, B
    [J]. PROTEIN ENGINEERING, 1999, 12 (02): : 85 - 94
  • [45] DATABASE OF PROTEIN-SEQUENCE ALIGNMENTS
    BARKER, WC
    GEORGE, DG
    SRINIVASARAO, GY
    YEH, LS
    [J]. FASEB JOURNAL, 1992, 6 (01): : A348 - A348
  • [46] Improvements to the JProfileGrid software for visualizing very large multiple sequence alignments
    Roca, Alberto I.
    Abajian, Aaron C.
    [J]. FASEB JOURNAL, 2011, 25
  • [47] Large multiple sequence alignments with a root-to-leaf regressive method
    Garriga, Edgar
    Di Tommaso, Paolo
    Magis, Cedrik
    Erb, Ionas
    Mansouri, Leila
    Baltzis, Athanasios
    Laayouni, Hafid
    Kondrashov, Fyodor
    Floden, Evan
    Notredame, Cedric
    [J]. NATURE BIOTECHNOLOGY, 2019, 37 (12) : 1466 - +
  • [48] Detecting species-site dependencies in large multiple sequence alignments
    Schwarz, Roland
    Seibel, Philipp N.
    Rahmann, Sven
    Schoen, Christoph
    Huenerberg, Mirja
    Mueller-Reible, Clemens
    Dandekar, Thomas
    Karchin, Rachel
    Schultz, Joerg
    Mueller, Tobias
    [J]. NUCLEIC ACIDS RESEARCH, 2009, 37 (18) : 5959 - 5968
  • [49] Large multiple sequence alignments with a root-to-leaf regressive method
    Edgar Garriga
    Paolo Di Tommaso
    Cedrik Magis
    Ionas Erb
    Leila Mansouri
    Athanasios Baltzis
    Hafid Laayouni
    Fyodor Kondrashov
    Evan Floden
    Cedric Notredame
    [J]. Nature Biotechnology, 2019, 37 : 1466 - 1470
  • [50] Bayesian Estimation of Divergence Times from Large Sequence Alignments
    Guindon, Stephane
    [J]. MOLECULAR BIOLOGY AND EVOLUTION, 2010, 27 (08) : 1768 - 1781