RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation

被引:535
|
作者
Li, Wenjun [1 ]
O'Neill, Kathleen R. [1 ]
Haft, Daniel H. [1 ]
DiCuccio, Michael [1 ]
Chetvernin, Vyacheslav [1 ]
Badretdin, Azat [1 ]
Coulouris, George [1 ]
Chitsaz, Farideh [1 ]
Derbyshire, Myra K. [1 ]
Durkin, A. Scott [1 ]
Gonzales, Noreen R. [1 ]
Gwadz, Marc [1 ]
Lanczycki, Christopher J. [1 ]
Song, James S. [1 ]
Thanki, Narmada [1 ]
Wang, Jiyao [1 ]
Yamashita, Roxanne A. [1 ]
Yang, Mingzhang [1 ]
Zheng, Chanjuan [1 ]
Marchler-Bauer, Aron [1 ]
Thibaud-Nissen, Francoise [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, 45 Ctr Dr, Bethesda, MD 20892 USA
基金
美国国家卫生研究院;
关键词
D O I
10.1093/nar/gkaa1105
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains nearly 200 000 bacterial and archaeal genomes and 150 million proteins with up-to-date annotation. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP) since 2018 have resulted in a substantial reduction in spurious annotation. The hierarchical collection of protein family models (PFMs) used by PGAP as evidence for structural and functional annotation was expanded to over 35 000 protein profile hidden Markov models (HMMs), 12 300 BlastRules and 36 000 curated CDD architectures. As a result, >122 million or 79% of RefSeq proteins are now named based on a match to a curated PFM. Gene symbols, Enzyme Commission numbers or supporting publication attributes are available on over 40% of the PFMs and are inherited by the proteins and features they name, facilitating multi-genome analyses and connections to the literature. In adherence with the principles of FAIR (findable, accessible, interoperable, reusable), the PFMs are available in the Protein Family Models Entrez database to any user. Finally, the reference and representative genome set, a taxonomically diverse subset of RefSeq prokaryotic genomes, is now recalculated regularly and available for download and homology searches with BLAST. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/.
引用
收藏
页码:D1020 / D1028
页数:9
相关论文
共 23 条
  • [1] RefSeq: an update on prokaryotic genome annotation and curation
    Haft, Daniel H.
    DiCuccio, Michael
    Badretdin, Azat
    Brover, Vyacheslav
    Chetvernin, Vyacheslav
    O'Neill, Kathleen
    Li, Wenjun
    Chitsaz, Farideh
    Derbyshire, Myra K.
    Gonzales, Noreen R.
    Gwadz, Marc
    Lu, Fu
    Marchler, Gabriele H.
    Song, James S.
    Thanki, Narmada
    Yamashita, Roxanne A.
    Zheng, Chanjuan
    Thibaud-Nissen, Francoise
    Geer, Lewis Y.
    Marchler-Bauer, Aron
    Pruitt, Kim D.
    [J]. NUCLEIC ACIDS RESEARCH, 2018, 46 (D1) : D851 - D860
  • [2] RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes
    Haft, Daniel H.
    Badretdin, Azat
    Coulouris, George
    Dicuccio, Michael
    Durkin, A. Scott
    Jovenitti, Eric
    Li, Wenjun
    Mersha, Megdelawit
    O'Neill, Kathleen R.
    Virothaisakun, Joel
    Thibaud-Nissen, Francoise
    [J]. NUCLEIC ACIDS RESEARCH, 2023, : D762 - D769
  • [3] NCBI prokaryotic genome annotation pipeline
    Tatusova, Tatiana
    DiCuccio, Michael
    Badretdin, Azat
    Chetvernin, Vyacheslav
    Nawrocki, Eric P.
    Zaslavsky, Leonid
    Lomsadze, Alexandre
    Pruitt, Kimd.
    Borodovsky, Mark
    Ostell, James
    [J]. NUCLEIC ACIDS RESEARCH, 2016, 44 (14) : 6614 - 6624
  • [4] Pannopi: prokaryotic genome assembly and annotation pipeline
    Zilov, Danil S.
    Komissarov, Aleksey S.
    [J]. BMC BIOINFORMATICS, 2020, 21 (SUPPL 20):
  • [5] Prokaryotic Contig Annotation Pipeline Server: Web Application for a Prokaryotic Genome Annotation Pipeline Based on the Shiny App Package
    Park, Byeonghyeok
    Baek, Min-Jeong
    Min, Byoungnam
    Choi, In-Geol
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2017, 24 (09) : 917 - 922
  • [6] DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication
    Tanizawa, Yasuhiro
    Fujisawa, Takatomo
    Nakamura, Yasukazu
    [J]. BIOINFORMATICS, 2018, 34 (06) : 1037 - 1039
  • [7] MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation
    Liao, Yu-Chieh
    Lin, Hsin-Hung
    Sabharwal, Amarpreet
    Haase, Elaine M.
    Scannapieco, Frank A.
    [J]. JOURNAL OF MICROBIOLOGICAL METHODS, 2015, 113 : 72 - 74
  • [8] A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome
    Jia-Feng Yu
    Jing Guo
    Qing-Bin Liu
    Yue Hou
    Ke Xiao
    Qing-Li Chen
    Ji-Hua Wang
    Xiao Sun
    [J]. Genes & Genomics, 2015, 37 : 347 - 355
  • [9] A hybrid strategy for comprehensive annotation of the protein coding genes in prokaryotic genome
    Yu, Jia-Feng
    Guo, Jing
    Liu, Qing-Bin
    Hou, Yue
    Xiao, Ke
    Chen, Qing-Li
    Wang, Ji-Hua
    Sun, Xiao
    [J]. GENES & GENOMICS, 2015, 37 (04) : 347 - 355
  • [10] Prokaryotic Virus Orthologous Groups (pVOGs): a resource for comparative genomics and protein family annotation
    Grazziotin, Ana Laura
    Koonin, Eugene V.
    Kristensen, David M.
    [J]. NUCLEIC ACIDS RESEARCH, 2017, 45 (D1) : D491 - D498