Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning

被引:3
|
作者
Yu, Youngjae [3 ]
Chung, Jiwan [3 ]
Yun, Heeseung [3 ]
Hessel, Jack [1 ]
Park, Jae Sung [1 ,5 ]
Lu, Ximing [1 ,5 ]
Zellers, Rowan [1 ,2 ]
Ammanabrolu, Prithviraj [1 ]
Le Bras, Ronan [1 ]
Kim, Gunhee [1 ,4 ]
Choi, Yejin [1 ,4 ]
机构
[1] Allen Inst Artificial Intelligence, Seattle, WA USA
[2] OpenAI, Seattle, WA USA
[3] Yonsei Univ, Dept Artificial Intelligence, Seoul, South Korea
[4] Seoul Natl Univ, Dept Comp Sci & Engn, Seoul, South Korea
[5] Univ Washington, Paul G Allen Sch Comp Sci, Seattle, WA 98195 USA
关键词
D O I
10.1109/CVPR52729.2023.01044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e.g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 [7] manifest broad commonsense reasoning capacity. Can their knowledge be extended to multimodal inputs such as images and audio without paired domain data? In this work, we propose ESPER (Extending Sensory PErception with Reinforcement learning) which enables text-only pretrained models to address multimodal tasks such as visual commonsense reasoning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, our reward optimization relies only on cosine similarity derived from CLIP [52] and requires no additional paired (image, text) data. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of multimodal text generation tasks ranging from captioning to commonsense reasoning; these include a new benchmark we collect and release, the ESP dataset, which tasks models with generating the text of several different domains for each image. Our code and data are publicly released at https://github.com/JiwanChung/esper.
引用
收藏
页码:10845 / 10856
页数:12
相关论文
共 50 条
  • [1] Exploring Lottery Prompts for Pre-trained Language Models
    Chen, Yulin
    Ding, Ning
    Wang, Xiaobin
    Hu, Shengding
    Zheng, Hai-Tao
    Liu, Zhiyuan
    Xie, Pengjun
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023): LONG PAPERS, VOL 1, 2023, : 15428 - 15444
  • [2] Vulnerability Analysis of Continuous Prompts for Pre-trained Language Models
    Li, Zhicheng
    Shi, Yundi
    Sheng, Xuan
    Yin, Changchun
    Zhou, Lu
    Li, Piji
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING, ICANN 2023, PT IX, 2023, 14262 : 508 - 519
  • [3] A Study on Accessing Linguistic Information in Pre-Trained Language Models by Using Prompts
    Di Marco, Marion
    Haemmerl, Katharina
    Fraser, Alexander
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 7328 - 7336
  • [4] Pre-trained multimodal end-to-end network for spoken language assessment incorporating prompts
    Lin, Binghuai
    Wang, Liyuan
    PROCEEDINGS OF 2022 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2022, : 1394 - 1398
  • [5] Pre-Trained Language Models and Their Applications
    Wang, Haifeng
    Li, Jiwei
    Wu, Hua
    Hovy, Eduard
    Sun, Yu
    ENGINEERING, 2023, 25 : 51 - 65
  • [6] Meta Distant Transfer Learning for Pre-trained Language Models
    Wang, Chengyu
    Pan, Haojie
    Qiu, Minghui
    Yang, Fei
    Huang, Jun
    Zhang, Yin
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9742 - 9752
  • [7] Simple and Effective Multimodal Learning Based on Pre-Trained Transformer Models
    Miyazawa, Kazuki
    Kyuragi, Yuta
    Nagai, Takayuki
    IEEE ACCESS, 2022, 10 : 29821 - 29833
  • [8] Multimodal Search on Iconclass using Vision-Language Pre-Trained Models
    Santini, Cristian
    Posthumus, Etienne
    Tietz, Tabea
    Tan, Mary Ann
    Bruns, Oleksandra
    Sack, Harald
    2023 ACM/IEEE JOINT CONFERENCE ON DIGITAL LIBRARIES, JCDL, 2023, : 285 - 287
  • [9] Vision Guided Generative Pre-trained Language Models for Multimodal Abstractive Summarization
    Yu, Tiezheng
    Dai, Wenliang
    Liu, Zihan
    Fung, Pascale
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 3995 - 4007
  • [10] The BLA Benchmark: Investigating Basic Language Abilities of Pre-Trained Multimodal Models
    Chen, Xinyi
    Fernandez, Raquel
    Pezzelle, Sandro
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 5817 - 5830