Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

被引:19
|
作者
Wen, Mingjian [1 ]
Blau, Samuel M. [1 ]
Xie, Xiaowei [2 ,3 ]
Dwaraknath, Shyam [4 ]
Persson, Kristin A. [5 ,6 ]
机构
[1] Lawrence Berkeley Natl Lab, Energy Technol Area, Berkeley, CA 94720 USA
[2] Univ Calif Berkeley, Coll Chem, Berkeley, CA 94720 USA
[3] Lawrence Berkeley Natl Lab, Mat Sci Div, Berkeley, CA 94720 USA
[4] Luxembourg Inst Sci & Technol, Luxembourg, Luxembourg
[5] Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA
[6] Lawrence Berkeley Natl Lab, Mol Foundry, Berkeley, CA 94720 USA
关键词
PREDICTION; CLASSIFICATION; OUTCOMES; DESIGN;
D O I
10.1039/d1sc06515g
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem-classifying reactions into distinct families-and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets, as well as those based on reaction fingerprints derived from masked language modelling. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.
引用
收藏
页码:1446 / 1458
页数:13
相关论文
共 50 条
  • [31] Adaptive Contrastive Learning with Label Consistency for Source Data Free Unsupervised Domain Adaptation
    Zhao, Xuejun
    Stanislawski, Rafal
    Gardoni, Paolo
    Sulowicz, Maciej
    Glowacz, Adam
    Krolczyk, Grzegorz
    Li, Zhixiong
    [J]. SENSORS, 2022, 22 (11)
  • [32] Model Adaptation: Historical Contrastive Learning for Unsupervised Domain Adaptation without Source Data
    Huang, Jiaxing
    Guan, Dayan
    Xiao, Aoran
    Lu, Shijian
    [J]. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [33] Unsupervised contrastive learning with simple transformation for 3D point cloud data
    Jiang, Jincen
    Lu, Xuequan
    Ouyang, Wanli
    Wang, Meili
    [J]. VISUAL COMPUTER, 2024, 40 (08): : 5169 - 5186
  • [34] Unsupervised Machine Learning for Augmented Data Analytics of Building Codes
    Zhang, Ruichuan
    El-Gohary, Nora
    [J]. COMPUTING IN CIVIL ENGINEERING 2019: DATA, SENSING, AND ANALYTICS, 2019, : 74 - 81
  • [35] Unsupervised machine learning for data-driven representations of reactions
    Sirumalla, Sai Krishna
    West, Richard
    [J]. ABSTRACTS OF PAPERS OF THE AMERICAN CHEMICAL SOCIETY, 2018, 256
  • [36] Exploration of critical care data by using unsupervised machine learning
    Hyun, Sookyung
    Kaewprag, Pacharmon
    Cooper, Cheryl
    Hixon, Brenda
    Moffatt-Bruce, Susan
    [J]. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE, 2020, 194
  • [37] Survey of Unsupervised Machine Learning Algorithms on Precision Agricultural Data
    Mehta, Parth
    Shah, Hetasha
    Kori, Vineet
    Vikani, Vivek
    Shukla, Soumya
    Shenoy, Mihir
    [J]. 2015 INTERNATIONAL CONFERENCE ON INNOVATIONS IN INFORMATION, EMBEDDED AND COMMUNICATION SYSTEMS (ICIIECS), 2015,
  • [38] Probabilistic and Unsupervised Machine Learning for Auditory Data and Pattern Recognition
    Luecke, Joerg
    [J]. 2017 SIGNAL PROCESSING: ALGORITHMS, ARCHITECTURES, ARRANGEMENTS, AND APPLICATIONS (SPA 2017), 2017, : 13 - 13
  • [39] Data Acquisition for Improving Machine Learning Models
    Li, Yifan
    Yu, Xiaohui
    Koudas, Nick
    [J]. PROCEEDINGS OF THE VLDB ENDOWMENT, 2021, 14 (10): : 1832 - 1844
  • [40] Chemical reaction networks and opportunities for machine learning
    Wen, Mingjian
    Spotte-Smith, Evan Walter Clark
    Blau, Samuel M.
    McDermott, Matthew J.
    Krishnapriyan, Aditi S.
    Persson, Kristin A.
    [J]. NATURE COMPUTATIONAL SCIENCE, 2023, 3 (01): : 12 - 24