MoleculeNet: a benchmark for molecular machine learning

被引:1338
|
作者
Wu, Zhenqin [1 ]
Ramsundar, Bharath [2 ]
Feinberg, Evan N. [3 ]
Gomes, Joseph [1 ]
Geniesse, Caleb [3 ]
Pappu, Aneesh S. [2 ]
Leswing, Karl [4 ]
Pande, Vijay [1 ]
机构
[1] Stanford Univ, Dept Chem, Stanford, CA 94305 USA
[2] Stanford Univ, Dept Comp Sci, Stanford, CA 94305 USA
[3] Stanford Sch Med, Program Biophys, Stanford, CA 94305 USA
[4] Schrodinger Inc, New York, NY USA
关键词
NEURAL-NETWORKS; AQUEOUS SOLUBILITY; PDBBIND DATABASE; FREE-ENERGIES; PREDICTION; CHEMOINFORMATICS; VALIDATION; COLLECTION; DRUGS;
D O I
10.1039/c7sc02664a
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Molecular machine learning has been maturing rapidly over the last few years. Improved methods and the presence of larger datasets have enabled machine learning algorithms to make increasingly accurate predictions about molecular properties. However, algorithmic progress has been limited due to the lack of a standard benchmark to compare the efficacy of proposed methods; most new algorithms are benchmarked on different datasets making it challenging to gauge the quality of proposed methods. This work introduces MoleculeNet, a large scale benchmark for molecular machine learning. MoleculeNet curates multiple public datasets, establishes metrics for evaluation, and offers high quality open-source implementations of multiple previously proposed molecular featurization and learning algorithms (released as part of the DeepChem open source library). MoleculeNet benchmarks demonstrate that learnable representations are powerful tools for molecular machine learning and broadly offer the best performance. However, this result comes with caveats. Learnable representations still struggle to deal with complex tasks under data scarcity and highly imbalanced classification. For quantum mechanical and biophysical datasets, the use of physics-aware featurizations can be more important than choice of particular learning algorithm.
引用
收藏
页码:513 / 530
页数:18
相关论文
共 50 条
  • [1] MolData, a molecular benchmark for disease and target based machine learning
    Arshadi, Arash Keshavarzi
    Salem, Milad
    Firouzbakht, Arash
    Yuan, Jiann Shiun
    [J]. JOURNAL OF CHEMINFORMATICS, 2022, 14 (01)
  • [2] MolData, a molecular benchmark for disease and target based machine learning
    Arash Keshavarzi Arshadi
    Milad Salem
    Arash Firouzbakht
    Jiann Shiun Yuan
    [J]. Journal of Cheminformatics, 14
  • [3] A benchmark dataset for machine learning in ecotoxicology
    Schuer, Christoph
    Gasser, Lilian
    Perez-Cruz, Fernando
    Schirmer, Kristin
    Baity-Jesi, Marco
    [J]. SCIENTIFIC DATA, 2023, 10 (01)
  • [4] A benchmark dataset for machine learning in ecotoxicology
    Christoph Schür
    Lilian Gasser
    Fernando Perez-Cruz
    Kristin Schirmer
    Marco Baity-Jesi
    [J]. Scientific Data, 10
  • [5] Benchmark and Survey of Automated Machine Learning Frameworks
    Zoeller, Marc-Andre
    Huber, Marco F.
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2021, 70 : 409 - 472
  • [6] A protein classification benchmark collection for machine learning
    Sonego, Paolo
    Pacurar, Mircea
    Dhir, Somdutta
    Kertesz-Farkas, Attila
    Kocsor, Andras
    Gaspari, Zoltan
    Leunissen, Jack A. M.
    Pongor, Sandor
    [J]. NUCLEIC ACIDS RESEARCH, 2007, 35 : D232 - D236
  • [7] A machine-learning benchmark for facies classification
    Alaudah, Yazeed
    Michalowicz, Patrycja
    Alfarraj, Motaz
    Alregib, Ghassan
    [J]. INTERPRETATION-A JOURNAL OF SUBSURFACE CHARACTERIZATION, 2019, 7 (03): : SE175 - SE187
  • [8] PDEBENCH: An Extensive Benchmark for Scientific Machine Learning
    Takamoto, Makoto
    Praditia, Timothy
    Leiteritz, Raphael
    MacKinlay, Dan
    Alesiani, Francesco
    Pflueger, Dirk
    Niepert, Mathias
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [9] Benchmark AFLOW Data Sets for Machine Learning
    Conrad L. Clement
    Steven K. Kauwe
    Taylor D. Sparks
    [J]. Integrating Materials and Manufacturing Innovation, 2020, 9 : 153 - 156
  • [10] Benchmark and Survey of Automated Machine Learning Frameworks
    Zöller M.-A.
    Huber M.F.
    [J]. Journal of Artificial Intelligence Research, 2021, 70 : 409 - 472