MolData, a molecular benchmark for disease and target based machine learning

被引:0
|
作者
Arash Keshavarzi Arshadi
Milad Salem
Arash Firouzbakht
Jiann Shiun Yuan
机构
[1] University of Central Florida,Burnett School of Biomedical Sciences
[2] University of Central Florida,Department of Electrical and Computer Engineering
[3] University of Illinois at Urbana,Department of Chemistry
来源
关键词
Artificial intelligence; Benchmark; Biological assays; Big data; Database; Drug discovery; Machine learning; PubChem;
D O I
暂无
中图分类号
学科分类号
摘要
Deep learning’s automatic feature extraction has been a revolutionary addition to computational drug discovery, infusing both the capabilities of learning abstract features and discovering complex molecular patterns via learning from molecular data. Since biological and chemical knowledge are necessary for overcoming the challenges of data curation, balancing, training, and evaluation, it is important for databases to contain information regarding the exact target and disease of each bioassay. The existing depositories such as PubChem or ChEMBL offer the screening data for millions of molecules against a variety of cells and targets, however, their bioassays contain complex biological descriptions which can hinder their usage by the machine learning community. In this work, a comprehensive disease and target-based dataset is collected from PubChem in order to facilitate and accelerate molecular machine learning for better drug discovery. MolData is one the largest efforts to date for democratizing the molecular machine learning, with roughly 170 million drug screening results from 1.4 million unique molecules assigned to specific diseases and targets. It also provides 30 unique categories of targets and diseases. Correlation analysis of the MolData bioassays unveils valuable information for drug repurposing for multiple diseases including cancer, metabolic disorders, and infectious diseases. Finally, we provide a benchmark of more than 30 models trained on each category using multitask learning. MolData aims to pave the way for computational drug discovery and accelerate the advancement of molecular artificial intelligence in a practical manner. The MolData benchmark data is available at https://GitHub.com/Transilico/MolData as well as within the additional files.
引用
收藏
相关论文
共 50 条
  • [1] MolData, a molecular benchmark for disease and target based machine learning
    Arshadi, Arash Keshavarzi
    Salem, Milad
    Firouzbakht, Arash
    Yuan, Jiann Shiun
    [J]. JOURNAL OF CHEMINFORMATICS, 2022, 14 (01)
  • [2] MoleculeNet: a benchmark for molecular machine learning
    Wu, Zhenqin
    Ramsundar, Bharath
    Feinberg, Evan N.
    Gomes, Joseph
    Geniesse, Caleb
    Pappu, Aneesh S.
    Leswing, Karl
    Pande, Vijay
    [J]. CHEMICAL SCIENCE, 2018, 9 (02) : 513 - 530
  • [3] Bugs in machine learning-based systems: a faultload benchmark
    Mohammad Mehdi Morovati
    Amin Nikanjam
    Foutse Khomh
    Zhen Ming (Jack) Jiang
    [J]. Empirical Software Engineering, 2023, 28
  • [4] Bugs in machine learning-based systems: a faultload benchmark
    Morovati, Mohammad Mehdi
    Nikanjam, Amin
    Khomh, Foutse
    Jiang, Zhen Ming
    [J]. EMPIRICAL SOFTWARE ENGINEERING, 2023, 28 (03)
  • [5] Underwater Target Detection Based on Machine Learning
    Zhang, Wen
    Wu, Yanqun
    Lin, Yonggang
    Ma, Lina
    Han, Kaifeng
    Chen, Yu
    Liu, Chen
    [J]. 2020 IEEE 3RD INTERNATIONAL CONFERENCE ON INFORMATION COMMUNICATION AND SIGNAL PROCESSING (ICICSP 2020), 2020, : 210 - 214
  • [6] A benchmark dataset for machine learning in ecotoxicology
    Schuer, Christoph
    Gasser, Lilian
    Perez-Cruz, Fernando
    Schirmer, Kristin
    Baity-Jesi, Marco
    [J]. SCIENTIFIC DATA, 2023, 10 (01)
  • [7] A benchmark dataset for machine learning in ecotoxicology
    Christoph Schür
    Lilian Gasser
    Fernando Perez-Cruz
    Kristin Schirmer
    Marco Baity-Jesi
    [J]. Scientific Data, 10
  • [8] A benchmark study of machine learning methods for molecular electronic transition: Tree-based ensemble learning versus graph neural network
    Kang, Beomchang
    Seok, Chaok
    Lee, Juyong
    [J]. BULLETIN OF THE KOREAN CHEMICAL SOCIETY, 2022, 43 (03) : 328 - 335
  • [9] An Online Learning Target Tracking Method Based on Extreme Learning Machine
    Xie, Liyan
    Yu, Yuanlong
    Huang, Zhiyong
    [J]. PROCEEDINGS OF THE 2016 12TH WORLD CONGRESS ON INTELLIGENT CONTROL AND AUTOMATION (WCICA), 2016, : 2080 - 2085
  • [10] Benchmark and Survey of Automated Machine Learning Frameworks
    Zoeller, Marc-Andre
    Huber, Marco F.
    [J]. JOURNAL OF ARTIFICIAL INTELLIGENCE RESEARCH, 2021, 70 : 409 - 472