Fusing Pre-trained Language Models with Multimodal Prompts through Reinforcement Learning

被引:3
|
作者
Yu, Youngjae [3 ]
Chung, Jiwan [3 ]
Yun, Heeseung [3 ]
Hessel, Jack [1 ]
Park, Jae Sung [1 ,5 ]
Lu, Ximing [1 ,5 ]
Zellers, Rowan [1 ,2 ]
Ammanabrolu, Prithviraj [1 ]
Le Bras, Ronan [1 ]
Kim, Gunhee [1 ,4 ]
Choi, Yejin [1 ,4 ]
机构
[1] Allen Inst Artificial Intelligence, Seattle, WA USA
[2] OpenAI, Seattle, WA USA
[3] Yonsei Univ, Dept Artificial Intelligence, Seoul, South Korea
[4] Seoul Natl Univ, Dept Comp Sci & Engn, Seoul, South Korea
[5] Univ Washington, Paul G Allen Sch Comp Sci, Seattle, WA 98195 USA
关键词
D O I
10.1109/CVPR52729.2023.01044
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Language models are capable of commonsense reasoning: while domain-specific models can learn from explicit knowledge (e.g. commonsense graphs [6], ethical norms [25]), and larger models like GPT-3 [7] manifest broad commonsense reasoning capacity. Can their knowledge be extended to multimodal inputs such as images and audio without paired domain data? In this work, we propose ESPER (Extending Sensory PErception with Reinforcement learning) which enables text-only pretrained models to address multimodal tasks such as visual commonsense reasoning. Our key novelty is to use reinforcement learning to align multimodal inputs to language model generations without direct supervision: for example, our reward optimization relies only on cosine similarity derived from CLIP [52] and requires no additional paired (image, text) data. Experiments demonstrate that ESPER outperforms baselines and prior work on a variety of multimodal text generation tasks ranging from captioning to commonsense reasoning; these include a new benchmark we collect and release, the ESP dataset, which tasks models with generating the text of several different domains for each image. Our code and data are publicly released at https://github.com/JiwanChung/esper.
引用
收藏
页码:10845 / 10856
页数:12
相关论文
共 50 条
  • [21] PhoBERT: Pre-trained language models for Vietnamese
    Dat Quoc Nguyen
    Anh Tuan Nguyen
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EMNLP 2020, 2020, : 1037 - 1042
  • [22] Deciphering Stereotypes in Pre-Trained Language Models
    Ma, Weicheng
    Scheible, Henry
    Wang, Brian
    Veeramachaneni, Goutham
    Chowdhary, Pratim
    Sung, Alan
    Koulogeorge, Andrew
    Wang, Lili
    Yang, Diyi
    Vosoughi, Soroush
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2023), 2023, : 11328 - 11345
  • [23] Knowledge Rumination for Pre-trained Language Models
    Yao, Yunzhi
    Wang, Peng
    Mao, Shengyu
    Tan, Chuanqi
    Huang, Fei
    Chen, Huajun
    Zhang, Ningyu
    2023 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING, EMNLP 2023, 2023, : 3387 - 3404
  • [24] HinPLMs: Pre-trained Language Models for Hindi
    Huang, Xixuan
    Lin, Nankai
    Li, Kexin
    Wang, Lianxi
    Gan, Suifu
    2021 INTERNATIONAL CONFERENCE ON ASIAN LANGUAGE PROCESSING (IALP), 2021, : 241 - 246
  • [25] Evaluating Commonsense in Pre-Trained Language Models
    Zhou, Xuhui
    Zhang, Yue
    Cui, Leyang
    Huang, Dandan
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 9733 - 9740
  • [26] Knowledge Inheritance for Pre-trained Language Models
    Qin, Yujia
    Lin, Yankai
    Yi, Jing
    Zhang, Jiajie
    Han, Xu
    Zhang, Zhengyan
    Su, Yusheng
    Liu, Zhiyuan
    Li, Peng
    Sun, Maosong
    Zhou, Jie
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3921 - 3937
  • [27] Code Execution with Pre-trained Language Models
    Liu, Chenxiao
    Lu, Shuai
    Chen, Weizhu
    Jiang, Daxin
    Svyatkovskiy, Alexey
    Fu, Shengyu
    Sundaresan, Neel
    Duan, Nan
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL 2023, 2023, : 4984 - 4999
  • [28] Probing for Hyperbole in Pre-Trained Language Models
    Schneidermann, Nina Skovgaard
    Hershcovich, Daniel
    Pedersen, Bolette Sandford
    PROCEEDINGS OF THE 61ST ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, ACL-SRW 2023, VOL 4, 2023, : 200 - 211
  • [29] Pre-trained language models in medicine: A survey *
    Luo, Xudong
    Deng, Zhiqi
    Yang, Binxia
    Luo, Michael Y.
    ARTIFICIAL INTELLIGENCE IN MEDICINE, 2024, 154
  • [30] A Comparison of Pre-trained Vision-and-Language Models for Multimodal Representation Learning across Medical Images and Reports
    Li, Yikuan
    Wang, Hanyin
    Luo, Yuan
    2020 IEEE INTERNATIONAL CONFERENCE ON BIOINFORMATICS AND BIOMEDICINE, 2020, : 1999 - 2004