Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning

被引:3
|
作者
Yu, Youngjae [3 ]
Chung, Jiwan [3 ]
Yun, Heeseung [3 ]
Hessel, Jack [1 ]
Park, Jae Sung [1 ,5 ]
Lu, Ximing [1 ,5 ]
Zellers, Rowan [1 ,2 ]
Ammanabrolu, Prithviraj [1 ]
Le Bras, Ronan [1 ]
Kim, Gunhee [1 ,4 ]
Choi, Yejin [1 ,4 ]
机构
[1] Allen Inst Artificial Intelligence, Seattle, WA USA
[2] OpenAI, Seattle, WA USA
[3] Yonsei Univ, Dept Artificial Intelligence, Seoul, South Korea
[4] Seoul Natl Univ, Dept Comp Sci & Engn, Seoul, South Korea
[5] Univ Washington, Paul G Allen Sch Comp Sci, Seattle, WA 98195 USA
关键词
D O I
10.1109/CVPR52729.2023.01044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e.g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 [7] manifest broad commonsense reasoning capacity. Can their knowledge be extended to multimodal inputs such as images and audio without paired domain data? In this work, we propose ESPER (Extending Sensory PErception with Reinforcement learning) which enables text-only pretrained models to address multimodal tasks such as visual commonsense reasoning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, our reward optimization relies only on cosine similarity derived from CLIP [52] and requires no additional paired (image, text) data. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of multimodal text generation tasks ranging from captioning to commonsense reasoning; these include a new benchmark we collect and release, the ESP dataset, which tasks models with generating the text of several different domains for each image. Our code and data are publicly released at https://github.com/JiwanChung/esper.
引用
收藏
页码:10845 / 10856
页数:12
相关论文
共 50 条
  • [31] Zero-Shot Recommendations with Pre-Trained Large Language Models for Multimodal Nudging
    Harrison, Rachel M.
    Dereventsov, Anton
    Bibin, Anton
    2023 23RD IEEE INTERNATIONAL CONFERENCE ON DATA MINING WORKSHOPS, ICDMW 2023, 2023, : 1535 - 1542
  • [32] UniRaG: Unification, Retrieval, and Generation for Multimodal Question Answering With Pre-Trained Language Models
    Lim, Qi Zhi
    Lee, Chin Poo
    Lim, Kian Ming
    Samingan, Ahmad Kamsani
    IEEE ACCESS, 2024, 12 : 71505 - 71519
  • [33] A Study of Pre-trained Language Models in Natural Language Processing
    Duan, Jiajia
    Zhao, Hui
    Zhou, Qian
    Qiu, Meikang
    Liu, Meiqin
    2020 IEEE INTERNATIONAL CONFERENCE ON SMART CLOUD (SMARTCLOUD 2020), 2020, : 116 - 121
  • [34] Learning to Modulate pre-trained Models in RL
    Schmied, Thomas
    Hofmarcher, Markus
    Paischer, Fabian
    Pascanu, Razvan
    Hochreiter, Sepp
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [35] Continual Learning with Pre-Trained Models: A Survey
    Zhou, Da-Wei
    Sun, Hai-Long
    Ning, Jingyi
    Ye, Han-Jia
    Zhan, De-Chuan
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 8363 - 8371
  • [36] Clinical efficacy of pre-trained large language models through the lens of aphasia
    Cong, Yan
    Lacroix, Arianna N.
    Lee, Jiyeon
    SCIENTIFIC REPORTS, 2024, 14 (01):
  • [37] Discrimination Bias Detection Through Categorical Association in Pre-Trained Language Models
    Dusi, Michele
    Arici, Nicola
    Gerevini, Alfonso Emilio
    Putelli, Luca
    Serina, Ivan
    IEEE ACCESS, 2024, 12 : 162651 - 162667
  • [38] Kurdish Sign Language Recognition Using Pre-Trained Deep Learning Models
    Alsaud, Ali A.
    Yousif, Raghad Z.
    Aziz, Marwan. M.
    Kareem, Shahab W.
    Maho, Amer J.
    JOURNAL OF ELECTRICAL SYSTEMS, 2024, 20 (06) : 1334 - 1344
  • [39] Efficient Data Learning for Open Information Extraction with Pre-trained Language Models
    Fan, Zhiyuan
    He, Shizhu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 13056 - 13063
  • [40] Integration of pre-trained protein language models into geometric deep learning networks
    Fang Wu
    Lirong Wu
    Dragomir Radev
    Jinbo Xu
    Stan Z. Li
    Communications Biology, 6