Large-Scale Pretraining Improves Sample Efficiency of Active Learning-Based Virtual Screening

被引:3
|
作者
Cao, Zhonglin [1 ]
Sciabola, Simone [1 ]
Wang, Ye [1 ]
机构
[1] Biogen, Med Chem, Cambridge, MA 02142 USA
关键词
MOLECULAR DOCKING; INHIBITOR; DISCOVERY; BINDING; GENERATION; DATABASE; ZINC;
D O I
10.1021/acs.jcim.3c01938
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
Virtual screening of large compound libraries to identify potential hit candidates is one of the earliest steps in drug discovery. As the size of commercially available compound collections grows exponentially to the scale of billions, active learning and Bayesian optimization have recently been proven as effective methods of narrowing down the search space. An essential component of those methods is a surrogate machine learning model that predicts the desired properties of compounds. An accurate model can achieve high sample efficiency by finding hits with only a fraction of the entire library being virtually screened. In this study, we examined the performance of a pretrained transformer-based language model and graph neural network in a Bayesian optimization active learning framework. The best pretrained model identifies 58.97% of the top-50,000 compounds after screening only 0.6% of an ultralarge library containing 99.5 million compounds, improving 8% over the previous state-of-the-art baseline. Through extensive benchmarks, we show that the superior performance of pretrained models persists in both structure-based and ligand-based drug discovery. Pretrained models can serve as a boost to the accuracy and sample efficiency of active learning-based virtual screening.
引用
收藏
页码:1882 / 1891
页数:10
相关论文
共 50 条
  • [41] On the machine learning-based smart beamforming for wireless virtualization with large-scale MIMO system
    Sapavath, Naveen Naik
    Safavat, Sunitha
    Rawat, Danda B.
    TRANSACTIONS ON EMERGING TELECOMMUNICATIONS TECHNOLOGIES, 2019, 30 (09):
  • [42] SU-SAMPLING BASED ACTIVE LEARNING FOR LARGE-SCALE HISTOPATHOLOGY IMAGE
    Shen, Yiqing
    Ke, Jing
    2021 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING (ICIP), 2021, : 116 - 120
  • [43] Ensemble Learning for Large Scale Virtual Screening on Apache Spark
    Sid, Karima
    Batouche, Mohamed
    COMPUTATIONAL INTELLIGENCE AND ITS APPLICATIONS, 2018, 522 : 244 - 256
  • [44] Large-scale virtual screening on public cloud resources with Apache Spark
    Capuccini, Marco
    Ahmed, Laeeq
    Schaal, Wesley
    Laure, Erwin
    Spjuth, Ola
    JOURNAL OF CHEMINFORMATICS, 2017, 9
  • [45] TrixX: structure-based molecule indexing for large-scale virtual screening in sublinear time
    Schellhammer, Ingo
    Rarey, Matthias
    JOURNAL OF COMPUTER-AIDED MOLECULAR DESIGN, 2007, 21 (05) : 223 - 238
  • [46] Discovery of Immunoproteasome Inhibitors Using Large-Scale Covalent Virtual Screening
    Scarpino, Andrea
    Bajusz, David
    Proj, Matic
    Gobec, Martina
    Sosic, Izidor
    Gobec, Stanislav
    Ferenczy, Gyoergy G.
    Keseru, Gyoergy M.
    MOLECULES, 2019, 24 (14):
  • [47] Large-scale virtual screening on public cloud resources with Apache Spark
    Marco Capuccini
    Laeeq Ahmed
    Wesley Schaal
    Erwin Laure
    Ola Spjuth
    Journal of Cheminformatics, 9
  • [48] Efficient Large-Scale Virtual Screening Based on Heterogeneous Many-Core Supercomputing System
    Liu, Hao
    Wang, Cunji
    Liu, Peng
    Liu, Chengchao
    Wang, Zhuoya
    Wei, Zhiqiang
    IEEE JOURNAL OF BIOMEDICAL AND HEALTH INFORMATICS, 2023, 27 (07) : 3579 - 3588
  • [49] TrixX: structure-based molecule indexing for large-scale virtual screening in sublinear time
    Ingo Schellhammer
    Matthias Rarey
    Journal of Computer-Aided Molecular Design, 2007, 21 : 223 - 238
  • [50] A Universal Machine Learning Algorithm for Large-Scale Screening of Materials
    Fanourgakis, George S.
    Gkagkas, Konstantinos
    Tylianakis, Emmanuel
    Froudakis, George E.
    JOURNAL OF THE AMERICAN CHEMICAL SOCIETY, 2020, 142 (08) : 3814 - 3822