Variation benchmark datasets: update, criteria, quality and applications

被引:28
|
作者
Sarkar, Anasua [1 ]
Yang, Yang [2 ,3 ]
Vihinen, Mauno [1 ]
机构
[1] Lund Univ, Dept Expt Med Sci, BMC B13, SE-22184 Lund, Sweden
[2] Soochow Univ, Sch Comp Sci & Technol, 1 Shizi St, Suzhou 215006, Jiangsu, Peoples R China
[3] Soochow Univ, Prov Key Lab Comp Informat Proc Technol, 1 Shizi St, Suzhou 215006, Jiangsu, Peoples R China
基金
瑞典研究理事会; 中国国家自然科学基金;
关键词
AMINO-ACID SUBSTITUTIONS; PREDICTING PROTEIN STABILITY; HUMAN-DISEASE GENES; COMPUTATIONAL TOOLS; MISSENSE VARIANTS; NUCLEOTIDE STRUCTURE; ACCURATE PREDICTION; MUTATION PATTERN; DATABASE; SEQUENCE;
D O I
10.1093/database/baz117
中图分类号
Q [生物科学];
学科分类号
07 ; 0710 ; 09 ;
摘要
Development of new computational methods and testing their performance has to be carried out using experimental data. Only in comparison to existing knowledge can method performance be assessed. For that purpose, benchmark datasets with known and verified outcome are needed. High-quality benchmark datasets are valuable and may be difficult, laborious and time consuming to generate. VariBench and VariSNP are the two existing databases for sharing variation benchmark datasets used mainly for variation interpretation. They have been used for training and benchmarking predictors for various types of variations and their effects. VariBench was updated with 419 new datasets from 109 papers containing altogether 329 014 152 variants; however, there is plenty of redundancy between the datasets. VariBench is freely available at http://structure.bmc.lu. se/VariBench/. The contents of the datasets vary depending on information in the original source. The available datasets have been categorized into 20 groups and subgroups. There are datasets for insertions and deletions, substitutions in coding and non-coding region, structure mapped, synonymous and benign variants. Effect-specific datasets include DNA regulatory elements, RNA splicing, and protein property for aggregation, binding free energy, disorder and stability. Then there are several datasets for molecule-specific and disease-specific applications, as well as one dataset for variation phenotype effects. Variants are often described at three molecular levels (DNA, RNA and protein) and sometimes also at the protein structural level including relevant cross references and variant descriptions. The updated VariBench facilitates development and testing of new methods and comparison of obtained performances to previously published methods. We compared the performance of the pathogenicity/tolerance predictor PON-P2 to several benchmark studies, and show that such comparisons are feasible and useful, however, there may be limitations due to lack of provided details and shared data.
引用
收藏
页数:16
相关论文
共 50 条
  • [31] Microscopic malaria parasitemia diagnosis and grading on benchmark datasets
    Rehman, Amjad
    Abbas, Naveed
    Saba, Tanzila
    Mehmood, Zahid
    Mahmood, Toqeer
    Ahmed, Khawaja Tehseen
    MICROSCOPY RESEARCH AND TECHNIQUE, 2018, 81 (09) : 1042 - 1058
  • [32] Blind face restoration: Benchmark datasets and a baseline model
    Zhang, Puyang
    Zhang, Kaihao
    Luo, Wenhan
    Li, Changsheng
    Wang, Guoren
    NEUROCOMPUTING, 2024, 574 (574)
  • [33] Optimizing Quantum Classification Algorithms on Classical Benchmark Datasets
    John, Manuel
    Schuhmacher, Julian
    Barkoutsos, Panagiotis
    Tavernelli, Ivano
    Tacchino, Francesco
    ENTROPY, 2023, 25 (06)
  • [34] A Collection of Benchmark Datasets for Evaluating Graph Layout Algorithms
    Di Bartolomeo, Sara
    Puerta, Eduardo
    Wilson, Connor
    Cronvrsanin, Tarik
    Dunne, Cody
    GRAPH DRAWING AND NETWORK VISUALIZATION, GD 2023, PT II, 2023, 14466 : 251 - 252
  • [35] Benchmark Datasets for 3D Computer Vision
    Guo, Yulan
    Zhang, Jun
    Lu, Min
    Wan, Jianwei
    Ma, Yanxin
    PROCEEDINGS OF THE 2014 9TH IEEE CONFERENCE ON INDUSTRIAL ELECTRONICS AND APPLICATIONS (ICIEA), 2014, : 1846 - 1851
  • [36] Human Variome Project Quality Assessment Criteria for Variation Databases
    Vihinen, Mauno
    Hancock, John M.
    Maglott, Donna R.
    Landrum, Melissa J.
    Schaafsma, Gerard C. P.
    Taschner, Peter
    HUMAN MUTATION, 2016, 37 (06) : 549 - 558
  • [37] DEVELOPMENT OF A DATABASE FOR BENCHMARK DATASETS IN PHOTOGRAMMETRY AND REMOTE SENSING
    Budde, Lina E.
    Schmidt, Jakob
    Javanmard-Ghareshiran, Arash
    Hunger, Sebastian
    Iwaszczuk, Dorota
    XXIV ISPRS CONGRESS: IMAGING TODAY, FORESEEING TOMORROW, COMMISSION I, 2022, 5-1 : 187 - 193
  • [38] Performance Evaluation of Classifiers for Spam Detection with Benchmark Datasets
    Bindu, V
    Thomas, Ciza
    PROCEEDINGS OF 2016 INTERNATIONAL CONFERENCE ON DATA MINING AND ADVANCED COMPUTING (SAPIENCE), 2016, : 17 - 22
  • [39] Clustering benchmark datasets exploiting the fundamental clustering problems
    Thrun, Michael C.
    Ultsch, Alfred
    DATA IN BRIEF, 2020, 30
  • [40] Reliable datasets for lighting programs validation - benchmark results
    Maamari, F
    Fontoynont, M
    Tsangrassoulis, A
    Marty, C
    Kopylov, E
    Sytnik, G
    SOLAR ENERGY, 2005, 79 (02) : 213 - 215