nablaDFT: Large-Scale Conformational Energy and Hamiltonian Prediction benchmark and dataset

被引:9
|
作者
Khrabrov, Kuzma [1 ]
Shenbin, Ilya [3 ]
Ryabov, Alexander [4 ,5 ]
Tsypin, Artem [1 ]
Telepov, Alexander [1 ]
Alekseev, Anton [3 ,7 ]
Grishin, Alexander [1 ]
Strashnov, Pavel [1 ]
Zhilyaev, Petr [4 ]
Nikolenko, Sergey [3 ,6 ]
Kadurin, Artur [1 ,2 ]
机构
[1] AIRI, Kutuzovskiy Prospect House 32 Bldg K1, Moscow 121170, Russia
[2] Kuban State Univ, Stavropolskaya St 149, Krasnodar 350040, Russia
[3] Russian Acad Sci, Steklov Math Inst, St Petersburg Dept, Nab R Fontanki 27, St Petersburg 191011, Russia
[4] Skolkovo Inst Sci & Technol, Ctr Mat Technol, Bolshoy Blvd 30,Bld 1, Moscow 121205, Russia
[5] Natl Res Univ, Moscow Inst Phys & Technol, Inst Sky Lane 9, Dolgoprudnyi 141700, Moscow Region, Russia
[6] ISP RAS Res Ctr Trusted Artificial Intelligence, Alexander Solzhenitsyn St 25, Moscow 109004, Russia
[7] St Petersburg Univ, 7-9 Univ Skaya Embankment, St Petersburg 199034, Russia
关键词
CHEMICAL UNIVERSE; DENSITY FUNCTIONALS; VIRTUAL EXPLORATION; ACCURATE; SYSTEMS;
D O I
10.1039/d2cp03966d
中图分类号
O64 [物理化学(理论化学)、化学物理学];
学科分类号
070304 ; 081704 ;
摘要
Electronic wave function calculation is a fundamental task of computational quantum chemistry. Knowledge of the wave function parameters allows one to compute physical and chemical properties of molecules and materials. Unfortunately, it is infeasible to compute the wave functions analytically even for simple molecules. Classical quantum chemistry approaches such as the Hartree-Fock method or density functional theory (DFT) allow to compute an approximation of the wave function but are very computationally expensive. One way to lower the computational complexity is to use machine learning models that can provide sufficiently good approximations at a much lower computational cost. In this work we: (1) introduce a new curated large-scale dataset of electron structures of drug-like molecules, (2) establish a novel benchmark for the estimation of molecular properties in the multi-molecule setting, and (3) evaluate a wide range of methods with this benchmark. We show that the accuracy of recently developed machine learning models deteriorates significantly when switching from the single-molecule to the multi-molecule setting. We also show that these models lack generalization over different chemistry classes. In addition, we provide experimental evidence that larger datasets lead to better ML models in the field of quantum chemistry.
引用
收藏
页码:25853 / 25863
页数:11
相关论文
共 50 条
  • [31] A Large-Scale Benchmark Dataset for Anomaly Detection and Rare Event Classification for Audio Forensics
    Abbasi, Ahmed
    Javed, Abdul Rehman Rehman
    Yasin, Amanullah
    Jalil, Zunera
    Kryvinska, Natalia
    Tariq, Usman
    IEEE ACCESS, 2022, 10 : 38885 - 38894
  • [32] A benchmark approach and dataset for large-scale lane mapping from MLS point clouds
    Mi, Xiaoxin
    Dong, Zhen
    Cao, Zhipeng
    Yang, Bisheng
    Cao, Zhen
    Zheng, Chao
    Stoter, Jantien
    Nan, Liangliang
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 133
  • [33] MS-Celeb-1M: A Dataset and Benchmark for Large-Scale Face Recognition
    Guo, Yandong
    Zhang, Lei
    Hu, Yuxiao
    He, Xiaodong
    Gao, Jianfeng
    COMPUTER VISION - ECCV 2016, PT III, 2016, 9907 : 87 - 102
  • [34] A Platform for Electrical Capacitance Tomography Large-scale Benchmark Dataset Generating and Image Reconstruction
    Zheng, Jin
    Peng, Lihui
    2017 IEEE INTERNATIONAL CONFERENCE ON IMAGING SYSTEMS AND TECHNIQUES (IST), 2017, : 138 - 143
  • [35] EMS: A Large-Scale Eye Movement Dataset, Benchmark, and New Model for Schizophrenia Recognition
    Song, Yingjie
    Liu, Zhi
    Li, Gongyang
    Xie, Jiawei
    Wu, Qiang
    Zeng, Dan
    Xu, Lihua
    Zhang, Tianhong
    Wang, Jijun
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024,
  • [36] A Dataset and Benchmark for Large-scale Multi-modal Face Anti-spoofing
    Zhang, Shifeng
    Wang, Xiaobo
    Liu, Ajian
    Zhao, Chenxu
    Wan, Jun
    Escalera, Sergio
    Shi, Hailin
    Wang, Zezheng
    Li, Stan Z.
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 919 - 928
  • [37] CSPC-Dataset: New LiDAR Point Cloud Dataset and Benchmark for Large-Scale Scene Semantic Segmentation
    Tong, Guofeng
    Li, Yong
    Chen, Dong
    Sun, Qi
    Cao, Wei
    Xiang, Guiqiu
    IEEE ACCESS, 2020, 8 : 87695 - 87718
  • [38] OmniArt: A Large-scale Artistic Benchmark
    Strezoski, Gjorgji
    Worring, Marcel
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2018, 14 (04)
  • [39] DMDD: A Large-Scale Dataset for Dataset Mentions Detection
    Pan, Huitong
    Zhang, Qi
    Dragut, Eduard
    Caragea, Cornelia
    Latecki, Longin Jan
    TRANSACTIONS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, 2023, 11 : 1132 - 1146
  • [40] Large-Scale Indoor Visual-Geometric Multimodal Dataset and Benchmark for Novel View Synthesis
    Cao, Junming
    Zhao, Xiting
    Schwertfeger, Soren
    SENSORS, 2024, 24 (17)