Transfer learning for molecular property predictions from small datasets

被引:0
|
作者
Kirschbaum, Thorren [1 ]
Bande, Annika [1 ,2 ]
机构
[1] Helmholtz Zentrum Berlin Mat & Energie GmbH, Theory Electron Dynam & Spect, Hahn Meitner Pl 1, D-14109 Berlin, Germany
[2] Leibniz Univ Hannover, Inst Inorgan Chem, Callinstr 9, D-30167 Hannover, Germany
关键词
FREE-ENERGIES; FREESOLV;
D O I
10.1063/5.0214754
中图分类号
TB3 [工程材料学];
学科分类号
0805 ; 080502 ;
摘要
Machine learning has emerged as a new tool in chemistry to bypass expensive experiments or quantum-chemical calculations, for example, in high-throughput screening applications. However, many machine learning studies rely on small datasets, making it difficult to efficiently implement powerful deep learning architectures such as message passing neural networks. In this study, we benchmark common machine learning models for the prediction of molecular properties on two small datasets, for which the best results are obtained with the message passing neural network PaiNN as well as SOAP molecular descriptors concatenated to a set of simple molecular descriptors tailored to gradient boosting with regression trees. To further improve the predictive capabilities of PaiNN, we present a transfer learning strategy that uses large datasets to pre-train the respective models and allows us to obtain more accurate models after fine-tuning on the original datasets. The pre-training labels are obtained from computationally cheap ab initio or semi-empirical models, and both datasets are normalized to mean zero and standard deviation one to align the labels' distributions. This study covers two small chemistry datasets, the Harvard Organic Photovoltaics dataset (HOPV, HOMO-LUMO-gaps), for which excellent results are obtained, and the FreeSolv dataset (solvation energies), where this method is less successful, probably due to a complex underlying learning task and the dissimilar methods used to obtain pre-training and fine-tuning labels. Finally, we find that for the HOPV dataset, the final training results do not improve monotonically with the size of the pre-training dataset, but pre-training with fewer data points can lead to more biased pre-trained models and higher accuracy after fine-tuning.<br /> (c) 2024 Author(s). All article content, except where otherwise noted, is licensed under a Creative Commons Attribution (CC BY) license(https://creativecommons.org/licenses/by/4.0/).
引用
收藏
页数:9
相关论文
共 50 条
  • [41] A hybrid approach based on transfer and ensemble learning for improving performances of deep learning models on small datasets
    Gultekin, Tunc
    Ugur, Aybars
    TURKISH JOURNAL OF ELECTRICAL ENGINEERING AND COMPUTER SCIENCES, 2021, 29 (07) : 3197 - 3211
  • [42] SURFNet: Super-resolution of Turbulent Flows with Transfer Learning using Small Datasets
    Obiols-Sales, Octavi
    Vishnu, Abhinav
    Malaya, Nicholas P.
    Chandramowlishwaran, Aparna
    30TH INTERNATIONAL CONFERENCE ON PARALLEL ARCHITECTURES AND COMPILATION TECHNIQUES (PACT 2021), 2021, : 331 - 344
  • [43] Transfer learning-based fault location with small datasets in VSC-HVDC
    Shang, Boyang
    Luo, Guomin
    Li, Meng
    Liu, Yinglin
    Hei, Jiaxin
    INTERNATIONAL JOURNAL OF ELECTRICAL POWER & ENERGY SYSTEMS, 2023, 151
  • [44] Improving random forest predictions in small datasets from two-phase sampling designs
    Han, Sunwoo
    Williamson, Brian D.
    Fong, Youyi
    BMC MEDICAL INFORMATICS AND DECISION MAKING, 2021, 21 (01)
  • [45] Improving random forest predictions in small datasets from two-phase sampling designs
    Sunwoo Han
    Brian D. Williamson
    Youyi Fong
    BMC Medical Informatics and Decision Making, 21
  • [46] Using machine learning to predict concrete's strength: learning from small datasets
    Ouyang, Boya
    Song, Yu
    Li, Yuhai
    Wu, Feishu
    Yu, Huizi
    Wang, Yongzhe
    Yin, Zhanyuan
    Luo, Xiaoshu
    Sant, Gaurav
    Bauchy, Mathieu
    ENGINEERING RESEARCH EXPRESS, 2021, 3 (01):
  • [47] BN parameter learning from small datasets based on uncertain priors
    Mei, Jun-Feng
    Gao, Xiao-Guang
    Wan, Kai-Fang
    Xi Tong Gong Cheng Yu Dian Zi Ji Shu/Systems Engineering and Electronics, 2014, 36 (06): : 1207 - 1214
  • [48] Accurate property prediction with interpretable machine learning model for small datasets via transformed atom vector
    Chen, Xinyu
    Lu, Shuaihua
    Wan, Xinyang
    Chen, Qian
    Zhou, Qionghua
    Wang, Jinlan
    PHYSICAL REVIEW MATERIALS, 2022, 6 (12)
  • [49] Transfer Learning and Fine-Tuning for Deep Learning-Based Tea Diseases Detection on Small Datasets
    Ramdan, Ade
    Heryana, Ana
    Arisal, Andria
    Kusumo, R. Budiarianto S.
    Pardede, Hilman F.
    2020 INTERNATIONAL CONFERENCE ON RADAR, ANTENNA, MICROWAVE, ELECTRONICS, AND TELECOMMUNICATIONS (ICRAMET): FOSTERING INNOVATION THROUGH ICTS FOR SUSTAINABLE SMART SOCIETY, 2020, : 206 - 211
  • [50] A cross-region transfer learning method for classification of community service cases with small datasets
    Liu, Zhao-ge
    Li, Xiang-yang
    Qiao, Li-min
    Durrani, Dilawar Khan
    KNOWLEDGE-BASED SYSTEMS, 2020, 193