Transfer learning for molecular property predictions from small datasets

被引:0
|
作者
Kirschbaum, Thorren [1 ]
Bande, Annika [1 ,2 ]
机构
[1] Helmholtz Zentrum Berlin Mat & Energie GmbH, Theory Electron Dynam & Spect, Hahn Meitner Pl 1, D-14109 Berlin, Germany
[2] Leibniz Univ Hannover, Inst Inorgan Chem, Callinstr 9, D-30167 Hannover, Germany
关键词
FREE-ENERGIES; FREESOLV;
D O I
10.1063/5.0214754
中图分类号
TB3 [工程材料学];
学科分类号
0805 ; 080502 ;
摘要
Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small datasets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small datasets, for which the best results are obtained with the message passing neural network PaiNN as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large datasets to pre-train the respective models and allows us to obtain more accurate models after fine-tuning on the original datasets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models, and both datasets are normalized to mean zero and standard deviation one to align the labels' distributions. This study covers two small chemistry datasets, the Harvard Organic Photovoltaics dataset (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and the FreeSolv dataset (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.<br /> (c) 2024 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license(https://creativecommons.org/licenses/by/4.0/).
引用
收藏
页数:9
相关论文
共 50 条
  • [31] Internal Transfer Learning for Improving Performance in Human Action Recognition for Small Datasets
    Wang, Tian
    Chen, Yang
    Zhang, Mengyi
    Chen, Jie
    Snoussi, Hichem
    IEEE ACCESS, 2017, 5 : 17627 - 17633
  • [32] Defect detection of injection molding products on small datasets using transfer learning
    Liu, Jiahuan
    Guo, Fei
    Gao, Huang
    Li, Maoyuan
    Zhang, Yun
    Zhou, Huamin
    JOURNAL OF MANUFACTURING PROCESSES, 2021, 70 : 400 - 413
  • [33] Active vs Transfer Learning Approaches for QoT Estimation with Small Training Datasets
    Azzimonti, Dario
    Rottondi, Cristina
    Giusti, Alessandro
    Tornatore, Massimo
    Bianco, Andrea
    2020 OPTICAL FIBER COMMUNICATIONS CONFERENCE AND EXPOSITION (OFC), 2020,
  • [34] Neural-network-based transfer learning for predicting cryo-CMOS characteristics from small datasets
    Inaba, Takumi
    Chiashi, Yusuke
    Ogura, Minoru
    Asai, Hidehiro
    Fuketa, Hiroshi
    Oka, Hiroshi
    Iizuka, Shota
    Kato, Kimihiko
    Shitakata, Shunsuke
    Mori, Takahiro
    APPLIED PHYSICS EXPRESS, 2024, 17 (07)
  • [35] Scale-Space Data Augmentation for Deep Transfer Learning of Crack Damage from Small Sized Datasets
    Tang, Shimin
    Chen, ZhiQiang
    JOURNAL OF NONDESTRUCTIVE EVALUATION, 2020, 39 (03)
  • [36] Assessing static glass leaching predictions from large datasets using machine learning
    Lillington, Joseph N. P.
    Gout, Thomas L.
    Harrison, Mike T.
    Farnan, Ian
    JOURNAL OF NON-CRYSTALLINE SOLIDS, 2020, 546
  • [37] Assessing static glass leaching predictions from large datasets using machine learning
    Lillington, Joseph N.P.
    Goût, Thomas L.
    Harrison, Mike T.
    Farnan, Ian
    Journal of Non-Crystalline Solids, 2021, 546
  • [38] Advances in machine learning with chemical language models in molecular property and reaction outcome predictions
    Das, Manajit
    Ghosh, Ankit
    Sunoj, Raghavan B.
    JOURNAL OF COMPUTATIONAL CHEMISTRY, 2024, 45 (14) : 1160 - 1176
  • [39] Geometric deep learning for molecular property predictions with chemical accuracy across chemical space
    Dobbelaere, Maarten R.
    Lengyel, Istvan
    Stevens, Christian V.
    Van Geem, Kevin M.
    JOURNAL OF CHEMINFORMATICS, 2024, 16 (01):
  • [40] Leveraging Transfer Learning for Predicting Protein-Small-Molecule Interaction Predictions
    Wang, Jian
    Dokholyan, Nikolay V.
    JOURNAL OF CHEMICAL INFORMATION AND MODELING, 2025,