Variation benchmark datasets: update, criteria, quality and applications

被引:28
|
作者
Sarkar, Anasua [1 ]
Yang, Yang [2 ,3 ]
Vihinen, Mauno [1 ]
机构
[1] Lund Univ, Dept Expt Med Sci, BMC B13, SE-22184 Lund, Sweden
[2] Soochow Univ, Sch Comp Sci & Technol, 1 Shizi St, Suzhou 215006, Jiangsu, Peoples R China
[3] Soochow Univ, Prov Key Lab Comp Informat Proc Technol, 1 Shizi St, Suzhou 215006, Jiangsu, Peoples R China
基金
瑞典研究理事会; 中国国家自然科学基金;
关键词
AMINO-ACID SUBSTITUTIONS; PREDICTING PROTEIN STABILITY; HUMAN-DISEASE GENES; COMPUTATIONAL TOOLS; MISSENSE VARIANTS; NUCLEOTIDE STRUCTURE; ACCURATE PREDICTION; MUTATION PATTERN; DATABASE; SEQUENCE;
D O I
10.1093/database/baz117
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu. se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data.
引用
收藏
页数:16
相关论文
共 50 条
  • [1] Representativeness of variation benchmark datasets
    Gerard C. P. Schaafsma
    Mauno Vihinen
    BMC Bioinformatics, 19
  • [2] Representativeness of variation benchmark datasets
    Schaafsma, Gerard C. P.
    Vihinen, Mauno
    BMC BIOINFORMATICS, 2018, 19
  • [3] Quality criteria benchmark for hyperspectral imagery
    Christophe, E
    Léger, D
    Mailhes, C
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2005, 43 (09): : 2103 - 2114
  • [4] Dehazing Evaluation: Real-World Benchmark Datasets, Criteria, and Baselines
    Zhao, Shiyu
    Zhang, Lin
    Huang, Shuaiyi
    Shen, Ying
    Zhao, Shengjie
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 6947 - 6962
  • [5] The need for training and benchmark datasets for convolutional neural networks in flood applications
    Khouakhi, Abdou
    Zawadzka, Joanna
    Truckell, Ian
    HYDROLOGY RESEARCH, 2022, 53 (06): : 795 - 806
  • [6] Quality criteria for cardiac images: An update
    Bernardi, G.
    Bar, O.
    Jezewski, T.
    Vano, E.
    Maccia, C.
    Trianni, A.
    Padovani, R.
    RADIATION PROTECTION DOSIMETRY, 2008, 129 (1-3) : 87 - 90
  • [7] Benchmark datasets for phylogenomic pipeline validation, applications for foodborne pathogen surveillance
    Timme, Ruth E.
    Rand, Hugh
    Shumway, Martin
    Trees, Eija K.
    Simmons, Mustafa
    Agarwala, Richa
    Davis, Steven
    Tillman, Glenn E.
    Defibaugh-Chavez, Stephanie
    Carleton, Heather A.
    Klimke, William A.
    Katz, Lee S.
    PEERJ, 2017, 5
  • [8] Benchmark update
    不详
    BYTE, 1996, 21 (04): : 40 - 40
  • [9] Biological variation database: structure and criteria used for generation and update
    Perich, Carmen
    Minchinela, Joana
    Ricos, Carmen
    Fernandez-Calle, Pilar
    Alvarez, Virtudes
    Vicenta Domenech, Maria
    Simon, Margarita
    Biosca, Carmen
    Boned, Beatriz
    Vicente Garcia-Lario, Jose
    Cava, Fernando
    Fernandez-Fernandez, Pilar
    Fraser, Callum G.
    CLINICAL CHEMISTRY AND LABORATORY MEDICINE, 2015, 53 (02) : 299 - 305
  • [10] Autocropping: A Closer Look at Benchmark Datasets
    Celona, Luigi
    Ciocca, Gianluigi
    Napoletano, Paolo
    Schettini, Raimondo
    IMAGE ANALYSIS AND PROCESSING - ICIAP 2019, PT II, 2019, 11752 : 315 - 325