BartSmiles: Generative Masked Language Models for Molecular Representations

被引:0
|
作者
Chilingaryan, Gayane [1 ]
Tamoyan, Hovhannes [1 ]
Tevosyan, Ani [1 ,3 ]
Babayan, Nelly [2 ,3 ]
Hambardzumyan, Karen [1 ]
Navoyan, Zaven [3 ]
Aghajanyan, Armen [4 ]
Khachatrian, Hrant [1 ,5 ]
Khondkaryan, Lusine [2 ,3 ]
机构
[1] YerevaNN, Yerevan 0025, Armenia
[2] NAS RA, Inst Mol Biol, Yerevan 0014, Armenia
[3] Toxometris Ai, Yerevan 0019, Armenia
[4] Meta AI Res, Menlo Pk, CA 94025 USA
[5] Yerevan State Univ, Yerevan 0025, Armenia
关键词
STRUCTURAL ALERTS; PREDICTION; CHEMISTRY;
D O I
10.1021/acs.jcim.4c00512
中图分类号
R914 [药物化学];
学科分类号
100701 ;
摘要
We discover a robust self-supervised strategy tailored toward molecular representations for generative masked language models through a series of tailored, in-depth ablations. Using this pretraining strategy, we train BARTSmiles, a BART-like model with an order of magnitude more compute than previous self-supervised molecular representations. In-depth evaluations show that BARTSmiles consistently outperforms other self-supervised representations across classification, regression, and generation tasks, setting a new state-of-the-art on eight tasks. We then show that when applied to the molecular domain, the BART objective learns representations that implicitly encode our downstream tasks of interest. For example, by selecting seven neurons from a frozen BARTSmiles, we can obtain a model having performance within two percentage points of the full fine-tuned model on task Clintox. Lastly, we show that standard attribution interpretability methods, when applied to BARTSmiles, highlight certain substructures that chemists use to explain specific properties of molecules. The code and pretrained model are publicly available.
引用
收藏
页码:5832 / 5843
页数:12
相关论文
共 50 条
  • [1] DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models
    He, Zhengfu
    Sun, Tianxiang
    Tang, Qiong
    Wang, Kuanning
    Huang, Xuanjing
    Qiu, Xipeng
    [J]. PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 1, 2023, : 4521 - 4534
  • [2] Towards developing probabilistic generative models for reasoning with natural language representations
    Marcu, D
    Popescu, AM
    [J]. COMPUTATIONAL LINGUISTICS AND INTELLIGENT TEXT PROCESSING, 2005, 3406 : 88 - 99
  • [3] Are Generative Models Structural Representations?
    Facchin, Marco
    [J]. MINDS AND MACHINES, 2021, 31 (02) : 277 - 303
  • [4] Are Generative Models Structural Representations?
    Marco Facchin
    [J]. Minds and Machines, 2021, 31 : 277 - 303
  • [5] Deriving Language Models from Masked Language Models
    Hennigen, Lucas Torroba
    Kim, Yoon
    [J]. 61ST CONFERENCE OF THE THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, VOL 2, 2023, : 1149 - 1159
  • [6] AUGMENTING MOLECULAR DEEP GENERATIVE MODELS WITH TOPOLOGICAL DATA ANALYSIS REPRESENTATIONS
    Schiff, Yair
    Chenthamarakshan, Vijil
    Hoffman, Samuel C.
    Ramamurthy, Karthikeyan Natesan
    Das, Payel
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 3783 - 3787
  • [7] MAGVLT: Masked Generative Vision-and-Language Transformer
    Kim, Sungwoong
    Jo, Daejin
    Lee, Donghoon
    Kim, Jongmin
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23338 - 23348
  • [8] A Language for Counterfactual Generative Models
    Tavares, Zenna
    Koppel, James
    Zhang, Xin
    Das, Ria
    Solar-Lezama, Armando
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 139, 2021, 139 : 7180 - 7191
  • [9] Dynamic representations and generative models of brain function
    Friston, KJ
    Price, CJ
    [J]. BRAIN RESEARCH BULLETIN, 2001, 54 (03) : 275 - 285
  • [10] Generative models for discovering sparse distributed representations
    Hinton, GE
    Ghahramani, Z
    [J]. PHILOSOPHICAL TRANSACTIONS OF THE ROYAL SOCIETY OF LONDON SERIES B-BIOLOGICAL SCIENCES, 1997, 352 (1358) : 1177 - 1190