RefSeq and the prokaryotic genome annotation pipeline in the age of metagenomes

被引:12
|
作者
Haft, Daniel H. [1 ]
Badretdin, Azat [1 ]
Coulouris, George [1 ]
Dicuccio, Michael [1 ]
Durkin, A. Scott [1 ]
Jovenitti, Eric [1 ]
Li, Wenjun [1 ]
Mersha, Megdelawit [1 ]
O'Neill, Kathleen R. [1 ]
Virothaisakun, Joel [1 ]
Thibaud-Nissen, Francoise [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bethesda, MD 20894 USA
关键词
D O I
10.1093/nar/gkad988
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The Reference Sequence (RefSeq) project at the National Center for Biotechnology Information (NCBI) contains over 315 000 bacterial and archaeal genomes and 236 million proteins with up-to-date and consistent annotation. In the past 3 years, we have expanded the diversity of the RefSeq collection by including the best quality metagenome-assembled genomes (MAGs) submitted to INSDC (DDBJ, ENA and GenBank), while maintaining its quality by adding validation checks. Assemblies are now more stringently evaluated for contamination and for completeness of annotation prior to acceptance into RefSeq. MAGs now account for over 17000 assemblies in RefSeq, split over 165 orders and 362 families. Changes in the Prokaryotic Genome Annotation Pipeline (PGAP), which is used to annotate nearly all RefSeq assemblies include better detection of protein-coding genes. Nearly 83% of RefSeq proteins are now named by a curated Protein Family Model, a 4.7% increase in the past three years ago. In addition to literature citations, Enzyme Commission numbers, and gene symbols, Gene Ontology terms are now assigned to 48% of RefSeq proteins, allowing for easier multi-genome comparison. RefSeq is found at https://www.ncbi.nlm.nih.gov/refseq/. PGAP is available as a stand-alone tool able to produce GenBank-ready files at https://github.com/ncbi/pgap. Graphical Abstract
引用
收藏
页码:D762 / D769
页数:8
相关论文
共 50 条
  • [1] RefSeq: an update on prokaryotic genome annotation and curation
    Haft, Daniel H.
    DiCuccio, Michael
    Badretdin, Azat
    Brover, Vyacheslav
    Chetvernin, Vyacheslav
    O'Neill, Kathleen
    Li, Wenjun
    Chitsaz, Farideh
    Derbyshire, Myra K.
    Gonzales, Noreen R.
    Gwadz, Marc
    Lu, Fu
    Marchler, Gabriele H.
    Song, James S.
    Thanki, Narmada
    Yamashita, Roxanne A.
    Zheng, Chanjuan
    Thibaud-Nissen, Francoise
    Geer, Lewis Y.
    Marchler-Bauer, Aron
    Pruitt, Kim D.
    [J]. NUCLEIC ACIDS RESEARCH, 2018, 46 (D1) : D851 - D860
  • [2] RefSeq: expanding the Prokaryotic Genome Annotation Pipeline reach with protein family model curation
    Li, Wenjun
    O'Neill, Kathleen R.
    Haft, Daniel H.
    DiCuccio, Michael
    Chetvernin, Vyacheslav
    Badretdin, Azat
    Coulouris, George
    Chitsaz, Farideh
    Derbyshire, Myra K.
    Durkin, A. Scott
    Gonzales, Noreen R.
    Gwadz, Marc
    Lanczycki, Christopher J.
    Song, James S.
    Thanki, Narmada
    Wang, Jiyao
    Yamashita, Roxanne A.
    Yang, Mingzhang
    Zheng, Chanjuan
    Marchler-Bauer, Aron
    Thibaud-Nissen, Francoise
    [J]. NUCLEIC ACIDS RESEARCH, 2021, 49 (D1) : D1020 - D1028
  • [3] NCBI prokaryotic genome annotation pipeline
    Tatusova, Tatiana
    DiCuccio, Michael
    Badretdin, Azat
    Chetvernin, Vyacheslav
    Nawrocki, Eric P.
    Zaslavsky, Leonid
    Lomsadze, Alexandre
    Pruitt, Kimd.
    Borodovsky, Mark
    Ostell, James
    [J]. NUCLEIC ACIDS RESEARCH, 2016, 44 (14) : 6614 - 6624
  • [4] Pannopi: prokaryotic genome assembly and annotation pipeline
    Zilov, Danil S.
    Komissarov, Aleksey S.
    [J]. BMC BIOINFORMATICS, 2020, 21 (SUPPL 20):
  • [5] Prokaryotic Contig Annotation Pipeline Server: Web Application for a Prokaryotic Genome Annotation Pipeline Based on the Shiny App Package
    Park, Byeonghyeok
    Baek, Min-Jeong
    Min, Byoungnam
    Choi, In-Geol
    [J]. JOURNAL OF COMPUTATIONAL BIOLOGY, 2017, 24 (09) : 917 - 922
  • [6] Mouse genome annotation by the RefSeq project
    Kelly M. McGarvey
    Tamara Goldfarb
    Eric Cox
    Catherine M. Farrell
    Tripti Gupta
    Vinita S. Joardar
    Vamsi K. Kodali
    Michael R. Murphy
    Nuala A. O’Leary
    Shashikant Pujar
    Bhanu Rajput
    Sanjida H. Rangwala
    Lillian D. Riddick
    David Webb
    Mathew W. Wright
    Terence D. Murphy
    Kim D. Pruitt
    [J]. Mammalian Genome, 2015, 26 : 379 - 390
  • [7] Mouse genome annotation by the RefSeq project
    McGarvey, Kelly M.
    Goldfarb, Tamara
    Cox, Eric
    Farrell, Catherine M.
    Gupta, Tripti
    Joardar, Vinita S.
    Kodali, Vamsi K.
    Murphy, Michael R.
    O'Leary, Nuala A.
    Pujar, Shashikant
    Rajput, Bhanu
    Rangwala, Sanjida H.
    Riddick, Lillian D.
    Webb, David
    Wright, Mathew W.
    Murphy, Terence D.
    Pruitt, Kim D.
    [J]. MAMMALIAN GENOME, 2015, 26 (9-10) : 379 - 390
  • [8] DFAST: a flexible prokaryotic genome annotation pipeline for faster genome publication
    Tanizawa, Yasuhiro
    Fujisawa, Takatomo
    Nakamura, Yasukazu
    [J]. BIOINFORMATICS, 2018, 34 (06) : 1037 - 1039
  • [9] MyPro: A seamless pipeline for automated prokaryotic genome assembly and annotation
    Liao, Yu-Chieh
    Lin, Hsin-Hung
    Sabharwal, Amarpreet
    Haase, Elaine M.
    Scannapieco, Frank A.
    [J]. JOURNAL OF MICROBIOLOGICAL METHODS, 2015, 113 : 72 - 74
  • [10] Prokka: rapid prokaryotic genome annotation
    Seemann, Torsten
    [J]. BIOINFORMATICS, 2014, 30 (14) : 2068 - 2069