nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset

被引:9
|
作者
Khrabrov, Kuzma [1 ]
Shenbin, Ilya [3 ]
Ryabov, Alexander [4 ,5 ]
Tsypin, Artem [1 ]
Telepov, Alexander [1 ]
Alekseev, Anton [3 ,7 ]
Grishin, Alexander [1 ]
Strashnov, Pavel [1 ]
Zhilyaev, Petr [4 ]
Nikolenko, Sergey [3 ,6 ]
Kadurin, Artur [1 ,2 ]
机构
[1] AIRI, Kutuzovskiy Prospect House 32 Bldg K1, Moscow 121170, Russia
[2] Kuban State Univ, Stavropolskaya St 149, Krasnodar 350040, Russia
[3] Russian Acad Sci, Steklov Math Inst, St Petersburg Dept, Nab R Fontanki 27, St Petersburg 191011, Russia
[4] Skolkovo Inst Sci & Technol, Ctr Mat Technol, Bolshoy Blvd 30,Bld 1, Moscow 121205, Russia
[5] Natl Res Univ, Moscow Inst Phys & Technol, Inst Sky Lane 9, Dolgoprudnyi 141700, Moscow Region, Russia
[6] ISP RAS Res Ctr Trusted Artificial Intelligence, Alexander Solzhenitsyn St 25, Moscow 109004, Russia
[7] St Petersburg Univ, 7-9 Univ Skaya Embankment, St Petersburg 199034, Russia
关键词
CHEMICAL UNIVERSE; DENSITY FUNCTIONALS; VIRTUAL EXPLORATION; ACCURATE; SYSTEMS;
D O I
10.1039/d2cp03966d
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Electronic wave function calculation is a fundamental task of computational quantum chemistry. Knowledge of the wave function parameters allows one to compute physical and chemical properties of molecules and materials. Unfortunately, it is infeasible to compute the wave functions analytically even for simple molecules. Classical quantum chemistry approaches such as the Hartree-Fock method or density functional theory (DFT) allow to compute an approximation of the wave function but are very computationally expensive. One way to lower the computational complexity is to use machine learning models that can provide sufficiently good approximations at a much lower computational cost. In this work we: (1) introduce a new curated large-scale dataset of electron structures of drug-like molecules, (2) establish a novel benchmark for the estimation of molecular properties in the multi-molecule setting, and (3) evaluate a wide range of methods with this benchmark. We show that the accuracy of recently developed machine learning models deteriorates significantly when switching from the single-molecule to the multi-molecule setting. We also show that these models lack generalization over different chemistry classes. In addition, we provide experimental evidence that larger datasets lead to better ML models in the field of quantum chemistry.
引用
收藏
页码:25853 / 25863
页数:11
相关论文
共 50 条
  • [41] SCU-Counting: A large-scale benchmark dataset for multi-class object counting
    Wei, Xiang-Yi
    Zhang, Li
    Ma, Hao-Yuan
    Zhang, Xiao-Fang
    TRANSPORTATION RESEARCH PART C-EMERGING TECHNOLOGIES, 2024, 163
  • [42] WaterBench-Iowa: a large-scale benchmark dataset for data-driven streamflow forecasting
    Demir, Ibrahim
    Xiang, Zhongrun
    Demiray, Bekir
    Sit, Muhammed
    EARTH SYSTEM SCIENCE DATA, 2022, 14 (12) : 5605 - 5616
  • [43] EgoCart: A Benchmark Dataset for Large-Scale Indoor Image-Based Localization in Retail Stores
    Spera, Emiliano
    Furnari, Antonino
    Battiato, Sebastiano
    Farinella, Giovanni Maria
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2021, 31 (04) : 1253 - 1267
  • [44] A Benchmark Dataset for Segmenting Liver, Vasculature and Lesions from Large-scale Computed Tomography Data
    Wang, Bo
    Yan, Qinzsen
    Xu, Zhengqing
    Ai, Jingyang
    Jin, Shuo
    Xu, Wei
    Zhao, Wei
    Zhang, Liang
    You, Zheng
    2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 6584 - 6591
  • [45] CUGUV: A Benchmark Dataset for Promoting Large-Scale Urban Village Mapping with Deep Learning Models
    Wang, Ziyi
    Sun, Qiao
    Zhang, Xiao
    Hu, Zekun
    Chen, Jiaoqi
    Zhong, Cheng
    Li, Hui
    SCIENTIFIC DATA, 2025, 12 (01)
  • [46] Plant Disease Recognition: A Large-Scale Benchmark Dataset and a Visual Region and Loss Reweighting Approach
    Liu, Xinda
    Min, Weiqing
    Mei, Shuhuan
    Wang, Lili
    Jiang, Shuqiang
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 2003 - 2015
  • [47] A large-scale multicenter breast cancer DCE-MRI benchmark dataset with expert segmentations
    Garrucho, Lidia
    Kushibar, Kaisar
    Reidel, Claire-Anne
    Joshi, Smriti
    Osuala, Richard
    Tsirikoglou, Apostolia
    Bobowicz, Maciej
    del Riego, Javier
    Catanese, Alessandro
    Gwozdziewicz, Katarzyna
    Cosaka, Maria-Laura
    Abo-Elhoda, Pasant M.
    Tantawy, Sara W.
    Sakrana, Shorouq S.
    Shawky-Abdelfatah, Norhan O.
    Salem, Amr Muhammad Abdo
    Kozana, Androniki
    Divjak, Eugen
    Ivanac, Gordana
    Nikiforaki, Katerina
    Klontzas, Michail E.
    Garcia-Dosda, Rosa
    Gulsun-Akpinar, Meltem
    Lafci, Oguz
    Mann, Ritse
    Martin-Isla, Carlos
    Prior, Fred
    Marias, Kostas
    Starmans, Martijn P. A.
    Strand, Fredrik
    Diaz, Oliver
    Igual, Laura
    Lekadir, Karim
    SCIENTIFIC DATA, 2025, 12 (01)
  • [48] SignAvatars: A Large-Scale 3D Sign Language Holistic Motion Dataset and Benchmark
    Yu, Zhengdi
    Huang, Shaoli
    Cheng, Yongkang
    Birdal, Tolga
    COMPUTER VISION - ECCV 2024, PT V, 2025, 15063 : 1 - 19
  • [49] JHU-CROWD plus plus : Large-Scale Crowd Counting Dataset and A Benchmark Method
    Sindagi, Vishwanath A.
    Yasarla, Rajeev
    Patel, Vishal M.
    IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE, 2022, 44 (05) : 2594 - 2609
  • [50] Large-scale RDF Dataset Slicing
    Marx, Edgard
    Shekarpour, Saeedeh
    Auer, Soeren
    Ngomo, Axel-Cyrille Ngonga
    2013 IEEE SEVENTH INTERNATIONAL CONFERENCE ON SEMANTIC COMPUTING (ICSC 2013), 2013, : 228 - 235