Improving machine learning performance on small chemical reaction data with unsupervised contrastive pretraining

被引：19

作者：

Wen, Mingjian ^{[1
]}

Blau, Samuel M. ^{[1
]}

Xie, Xiaowei ^{[2
,3
]}

Dwaraknath, Shyam ^{[4
]}

Persson, Kristin A. ^{[5
,6
]}

机构：

[1] Lawrence Berkeley Natl Lab, Energy Technol Area, Berkeley, CA 94720 USA

[2] Univ Calif Berkeley, Coll Chem, Berkeley, CA 94720 USA

[3] Lawrence Berkeley Natl Lab, Mat Sci Div, Berkeley, CA 94720 USA

[4] Luxembourg Inst Sci & Technol, Luxembourg, Luxembourg

[5] Univ Calif Berkeley, Dept Mat Sci & Engn, Berkeley, CA 94720 USA

[6] Lawrence Berkeley Natl Lab, Mol Foundry, Berkeley, CA 94720 USA

来源：

CHEMICAL SCIENCE | 2022年 / 13卷 / 05期

关键词：

PREDICTION; CLASSIFICATION; OUTCOMES; DESIGN;

D O I：

10.1039/d1sc06515g

中图分类号：

O6 [化学];

学科分类号：

0703 ;

摘要：

Machine learning (ML) methods have great potential to transform chemical discovery by accelerating the exploration of chemical space and drawing scientific insights from data. However, modern chemical reaction ML models, such as those based on graph neural networks (GNNs), must be trained on a large amount of labelled data in order to avoid overfitting the data and thus possessing low accuracy and transferability. In this work, we propose a strategy to leverage unlabelled data to learn accurate ML models for small labelled chemical reaction data. We focus on an old and prominent problem-classifying reactions into distinct families-and build a GNN model for this task. We first pretrain the model on unlabelled reaction data using unsupervised contrastive learning and then fine-tune it on a small number of labelled reactions. The contrastive pretraining learns by making the representations of two augmented versions of a reaction similar to each other but distinct from other reactions. We propose chemically consistent reaction augmentation methods that protect the reaction center and find they are the key for the model to extract relevant information from unlabelled data to aid the reaction classification task. The transfer learned model outperforms a supervised model trained from scratch by a large margin. Further, it consistently performs better than models based on traditional rule-driven reaction fingerprints, which have long been the default choice for small datasets, as well as those based on reaction fingerprints derived from masked language modelling. In addition to reaction classification, the effectiveness of the strategy is tested on regression datasets; the learned GNN-based reaction fingerprints can also be used to navigate the chemical reaction space, which we demonstrate by querying for similar reactions. The strategy can be readily applied to other predictive reaction problems to uncover the power of unlabelled data for learning better models with a limited supply of labels.

引用

页码：1446 / 1458

页数：13

共 50 条

[1] Reinforcement Learning for Improving Chemical Reaction Performance
Hoque, Ajnabiul
Surve, Mihir
Kalyanakrishnan, Shivaram
Sunoj, Raghavan B.
[J]. Journal of the American Chemical Society, 2024,
[2] Improving the performance of machine learning penicillin adverse drug reaction classification with synthetic data and transfer learning
Stanekova, Viera
Inglis, Joshua M.
Lam, Lydia
Lam, Antoinette
Smith, William
Shakib, Sepehr
Bacchi, Stephen
[J]. INTERNAL MEDICINE JOURNAL, 2024, 54 (07) : 1183 - 1189
[3] Semisupervised Machine Fault Diagnosis Fusing Unsupervised Graph Contrastive Learning
Yang, Chaoying
Liu, Jie
Zhou, Kaibo
Jiang, Xingxing
[J]. IEEE TRANSACTIONS ON INDUSTRIAL INFORMATICS, 2023, 19 (08) : 8644 - 8653
[4] Prefix Data Augmentation for Contrastive Learning of Unsupervised Sentence Embedding
Wang, Chunchun
Lv, Shu
[J]. APPLIED SCIENCES-BASEL, 2024, 14 (07):
[5] Improving diversity and discriminability based implicit contrastive learning for unsupervised domain adaptation
Xu, Heng
Shi, Chuanqi
Fan, Wenze
Chen, Zhenghan
[J]. APPLIED INTELLIGENCE, 2024, 54 (20) : 10007 - 10017
[6] Improving the robustness of machine reading comprehension via contrastive learning
Feng, Jianzhou
Sun, Jiawei
Shao, Di
Cui, Jinman
[J]. APPLIED INTELLIGENCE, 2023, 53 (08) : 9103 - 9114
[7] Improving the robustness of machine reading comprehension via contrastive learning
Jianzhou Feng
Jiawei Sun
Di Shao
Jinman Cui
[J]. Applied Intelligence, 2023, 53 : 9103 - 9114
[8] Improving BERTScore for Machine Translation Evaluation Through Contrastive Learning
Tang, Gongbo
Yousuf, Oreen
Jin, Zeying
[J]. IEEE ACCESS, 2024, 12 : 77739 - 77749
[9] Improving performance of gene selection by unsupervised learning
Wang, MY
Wu, P
Xia, SR
[J]. PROCEEDINGS OF 2003 INTERNATIONAL CONFERENCE ON NEURAL NETWORKS & SIGNAL PROCESSING, PROCEEDINGS, VOLS 1 AND 2, 2003, : 45 - 48
[10] Unsupervised semantic segmentation of radar sounder data using contrastive learning
Donini, Elena
Amico, Mattia
Bruzzone, Lorenzo
Bovolo, Francesca
[J]. IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267

← 1 2 3 4 5 →