Reference sequence (RefSeq) database at NCBI: current status, taxonomic expansion, and functional annotation

被引:3552
|
作者
O'Leary, Nuala A. [1 ]
Wright, Mathew W. [1 ]
Brister, J. Rodney [1 ]
Ciufo, Stacy [1 ]
McVeigh, Diana Haddad Rich [1 ]
Rajput, Bhanu [1 ]
Robbertse, Barbara [1 ]
Smith-White, Brian [1 ]
Ako-Adjei, Danso [1 ]
Astashyn, Alexander [1 ]
Badretdin, Azat [1 ]
Bao, Yiming [1 ]
Blinkova, Olga [1 ]
Brover, Vyacheslav [1 ]
Chetvernin, Vyacheslav [1 ]
Choi, Jinna [1 ]
Cox, Eric [1 ]
Ermolaeva, Olga [1 ]
Farrell, Catherine M. [1 ]
Goldfarb, Tamara [1 ]
Gupta, Tripti [1 ]
Haft, Daniel [1 ]
Hatcher, Eneida [1 ]
Hlavina, Wratko [1 ]
Joardar, Vinita S. [1 ]
Kodali, Vamsi K. [1 ]
Li, Wenjun [1 ]
Maglott, Donna [1 ]
Masterson, Patrick [1 ]
McGarvey, Kelly M. [1 ]
Murphy, Michael R. [1 ]
O'Neill, Kathleen [1 ]
Pujar, Shashikant [1 ]
Rangwala, Sanjida H. [1 ]
Rausch, Daniel [1 ]
Riddick, Lillian D. [1 ]
Schoch, Conrad [1 ]
Shkeda, Andrei [1 ]
Storz, Susan S. [1 ]
Sun, Hanzhen [1 ]
Thibaud-Nissen, Francoise [1 ]
Tolstoy, Igor [1 ]
Tully, Raymond E. [1 ]
Vatsan, Anjana R. [1 ]
Wallin, Craig [1 ]
Webb, David [1 ]
Wu, Wendy [1 ]
Landrum, Melissa J. [1 ]
Kimchi, Avi [1 ]
Tatusova, Tatiana [1 ]
机构
[1] NIH, Natl Ctr Biotechnol Informat, Natl Lib Med, Bldg 38A,8600 Rockville Pike, Bethesda, MD 20894 USA
基金
美国国家卫生研究院;
关键词
GENOME ANNOTATION; MICROBIAL GENOMES; COMPARISON PASC; IDENTIFICATION; INSIGHTS; FAMILIES; FUTURE; TOOL;
D O I
10.1093/nar/gkv1189
中图分类号
Q5 [生物化学]; Q7 [分子生物学];
学科分类号
071010 ; 081704 ;
摘要
The RefSeq project at the National Center for Biotechnology Information (NCBI) maintains and curates a publicly available database of annotated genomic, transcript, and protein sequence records (http://www.ncbi.nlm.nih.gov/refseq/). The RefSeq project leverages the data submitted to the International Nucleotide Sequence Database Collaboration (INSDC) against a combination of computation, manual curation, and collaboration to produce a standard set of stable, non-redundant reference sequences. The RefSeq project augments these reference sequences with current knowledge including publications, functional features and informative nomenclature. The database currently represents sequences from more than 55 000 organisms (>4800 viruses, >40 000 prokaryotes and >10 000 eukaryotes; RefSeq release 71), ranging from a single record to complete genomes. This paper summarizes the current status of the viral, prokaryotic, and eukaryotic branches of the RefSeq project, reports on improvements to data access and details efforts to further expand the taxonomic representation of the collection. We also highlight diverse functional curation initiatives that support multiple uses of RefSeq data including taxonomic validation, genome annotation, comparative genomics, and clinical testing. We summarize our approach to utilizing available RNA-Seq and other data types in our manual curation process for vertebrate, plant, and other species, and describe a new direction for prokaryotic genomes and protein name management.
引用
收藏
页码:D733 / D745
页数:13
相关论文
共 9 条
  • [1] NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy
    Pruitt, Kim D.
    Tatusova, Tatiana
    Brown, Garth R.
    Maglott, Donna R.
    [J]. NUCLEIC ACIDS RESEARCH, 2012, 40 (D1) : D130 - D135
  • [2] NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
    Pruitt, KD
    Tatusova, T
    Maglott, DR
    [J]. NUCLEIC ACIDS RESEARCH, 2005, 33 : D501 - D504
  • [3] NCBI Reference Sequence Project: update and current status
    Pruitt, KD
    Tatusova, T
    Maglott, DR
    [J]. NUCLEIC ACIDS RESEARCH, 2003, 31 (01) : 34 - 37
  • [4] NCBI reference sequences (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins
    Pruitt, Kim D.
    Tatusova, Tatiana
    Maglott, Donna R.
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : D61 - D65
  • [5] NCBI Reference Sequences: current status, policy and new initiatives
    Pruitt, Kim D.
    Tatusova, Tatiana
    Klimke, William
    Maglott, Donna R.
    [J]. NUCLEIC ACIDS RESEARCH, 2009, 37 : D32 - D36
  • [6] Mining NCBI Sequence Read Archive Database: An Untapped Source of Organelle Genomes for Taxonomic and Comparative Genomics Research
    Eldem, Vahap
    Balci, Mehmet Ali
    [J]. DIVERSITY-BASEL, 2024, 16 (02):
  • [7] Current status and new features of the Consensus Coding Sequence database
    Farrell, Catherine M.
    O'Leary, Nuala A.
    Harte, Rachel A.
    Loveland, Jane E.
    Wilming, Laurens G.
    Wallin, Craig
    Diekhans, Mark
    Barrell, Daniel
    Searle, Stephen M. J.
    Aken, Bronwen
    Hiatt, Susan M.
    Frankish, Adam
    Suner, Marie-Marthe
    Rajput, Bhanu
    Steward, Charles A.
    Brown, Garth R.
    Bennett, Ruth
    Murphy, Michael
    Wu, Wendy
    Kay, Mike P.
    Hart, Jennifer
    Rajan, Jeena
    Weber, Janet
    Snow, Catherine
    Riddick, Lillian D.
    Hunt, Toby
    Webb, David
    Thomas, Mark
    Tamez, Pamela
    Rangwala, Sanjida H.
    McGarvey, Kelly M.
    Pujar, Shashikant
    Shkeda, Andrei
    Mudge, Jonathan M.
    Gonzalez, Jose M.
    Gilbert, James G. R.
    Trevanion, Stephen J.
    Baertsch, Robert
    Harrow, Jennifer L.
    Hubbard, Tim
    Ostell, James M.
    Haussler, David
    Pruitt, Kim D.
    [J]. NUCLEIC ACIDS RESEARCH, 2014, 42 (D1) : D865 - D872
  • [8] The PRINTS database: a fine-grained protein sequence annotation and analysis resource-its status in 2012
    Attwood, Teresa K.
    Coletta, Alain
    Muirhead, Gareth
    Pavlopoulou, Athanasia
    Philippou, Peter B.
    Popov, Ivan
    Roma-Mateo, Carlos
    Theodosiou, Athina
    Mitchell, Alex L.
    [J]. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION, 2012,
  • [9] Construction & assessment of a unified curated reference database for improving the taxonomic classification of bacteria using 16S rRNA sequence data
    Agnihotry, Shikha
    Sarangi, Aditya N.
    Aggarwal, Rakesh
    [J]. INDIAN JOURNAL OF MEDICAL RESEARCH, 2020, 151 (01) : 93 - 103