Variation benchmark datasets: update, criteria, quality and applications

被引:28
|
作者
Sarkar, Anasua [1 ]
Yang, Yang [2 ,3 ]
Vihinen, Mauno [1 ]
机构
[1] Lund Univ, Dept Expt Med Sci, BMC B13, SE-22184 Lund, Sweden
[2] Soochow Univ, Sch Comp Sci & Technol, 1 Shizi St, Suzhou 215006, Jiangsu, Peoples R China
[3] Soochow Univ, Prov Key Lab Comp Informat Proc Technol, 1 Shizi St, Suzhou 215006, Jiangsu, Peoples R China
基金
瑞典研究理事会; 中国国家自然科学基金;
关键词
AMINO-ACID SUBSTITUTIONS; PREDICTING PROTEIN STABILITY; HUMAN-DISEASE GENES; COMPUTATIONAL TOOLS; MISSENSE VARIANTS; NUCLEOTIDE STRUCTURE; ACCURATE PREDICTION; MUTATION PATTERN; DATABASE; SEQUENCE;
D O I
10.1093/database/baz117
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu. se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data.
引用
收藏
页数:16
相关论文
共 50 条
  • [21] Image Classification With Small Datasets: Overview and Benchmark
    Brigato, Lorenzo
    Barz, Bjoern
    Iocchi, Luca
    Denzler, Joachim
    IEEE ACCESS, 2022, 10 : 49233 - 49250
  • [22] A comparison of fuzzy identification methods on benchmark datasets
    Aleksovski, Darko
    Dovzan, Dejan
    Dzeroski, Soso
    Kocijan, Jus
    IFAC PAPERSONLINE, 2016, 49 (05): : 31 - 36
  • [23] A survey of RDF management technologies and benchmark datasets
    Pan, Zhengyu
    Zhu, Tao
    Liu, Hong
    Ning, Huansheng
    JOURNAL OF AMBIENT INTELLIGENCE AND HUMANIZED COMPUTING, 2018, 9 (05) : 1693 - 1704
  • [24] Predicting Classification Performance for Benchmark Hyperspectral Datasets
    Zhao, Bin
    Ragnarsson, Haukur Isfeld
    Ulfarsson, Magnus O.
    Cavallaro, Gabriele
    Benediktsson, Jon Atli
    IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2022, 15 : 4180 - 4193
  • [25] MFC Datasets: Large-Scale Benchmark Datasets for Media Forensic Challenge Evaluation
    Guan, Haiying
    Kozak, Mark
    Robertson, Eric
    Lee, Yooyoung
    Yates, Amy N.
    Delgado, Andrew
    Zhou, Daniel
    Kheyrkhah, Timothee
    Smith, Jeff
    Fiscus, Jonathan
    2019 IEEE WINTER APPLICATIONS OF COMPUTER VISION WORKSHOPS (WACVW), 2019, : 63 - 72
  • [26] PET/CT Technology Update, Quality Assurance and Applications
    Mawlawi, O.
    MEDICAL PHYSICS, 2010, 37 (06) : 3419 - +
  • [27] What is in the KGQA Benchmark Datasets? Survey on Challenges in Datasets for Question Answering on Knowledge Graphs
    Steinmetz, Nadine
    Sattler, Kai-Uwe
    JOURNAL ON DATA SEMANTICS, 2021, 10 (3-4) : 241 - 265
  • [28] PET/CT Technology Update, Quality Assurance and Applications
    Mawlawi, O.
    MEDICAL PHYSICS, 2009, 36 (06)
  • [29] A benchmark of Spanish language datasets for computationally driven research
    Candela, Gustavo
    Saez, Maria-Dolores
    Escobar, Pilar
    Marco-Such, Manuel
    JOURNAL OF INFORMATION SCIENCE, 2023, 49 (06) : 1451 - 1461
  • [30] Open Graph Benchmark: Datasets for Machine Learning on Graphs
    Hu, Weihua
    Fey, Matthias
    Zitnik, Marinka
    Dong, Yuxiao
    Ren, Hongyu
    Liu, Bowen
    Catasta, Michele
    Leskovec, Jure
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 33, NEURIPS 2020, 2020, 33