Unifying Vision-and-Language Tasks via Text Generation

被引:0
|
作者
Cho, Jaemin [1 ]
Lei, Jie [1 ]
Tan, Hao [1 ]
Bansal, Mohit [1 ]
机构
[1] Univ N Carolina, Chapel Hill, NC 27599 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j -min/VL-T5
引用
收藏
页数:12
相关论文
共 50 条
  • [21] Recent Advances in Vision-and-language Navigation
    Sima S.-L.
    Huang Y.
    He K.-J.
    An D.
    Yuan H.
    Wang L.
    Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (01): : 1 - 14
  • [22] On the Evaluation of Vision-and-Language Navigation Instructions
    Zhao, Ming
    Anderson, Peter
    Jain, Vihan
    Wang, Su
    Ku, Alexander
    Baldridge, Jason
    Ie, Eugene
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1302 - 1316
  • [23] Curriculum Learning for Vision-and-Language Navigation
    Zhang, Jiwen
    Wei, Zhongyu
    Fan, Jianqing
    Peng, Jiajie
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 34 (NEURIPS 2021), 2021, 34
  • [24] Why Only Text: Empowering Vision-and-Language Navigation with Multi-modal Prompts
    Hong, Haodong
    Sen Wang
    Huang, Zi
    Wu, Qi
    Liu, Jiajun
    PROCEEDINGS OF THE THIRTY-THIRD INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, IJCAI 2024, 2024, : 839 - 847
  • [25] Episodic Transformer for Vision-and-Language Navigation
    Pashevich, Alexander
    Schmid, Cordelia
    Sun, Chen
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2021), 2021, : 15922 - 15932
  • [26] WebVLN: Vision-and-Language Navigation on Websites
    Chen, Qi
    Pitawela, Dileepa
    Zhao, Chongyang
    Zhou, Gengze
    Chen, Hsiang-Ting
    Wu, Qi
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 2, 2024, : 1165 - 1173
  • [27] Discovering Intrinsic Subgoals for Vision-and-Language Navigation via Hierarchical Reinforcement Learning
    Wang, Jiawei
    Wang, Teng
    Xu, Lele
    He, Zichen
    Sun, Changyin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, : 1 - 13
  • [28] Visual Perception Generalization for Vision-and-Language Navigation via Meta-Learning
    Wang, Ting
    Wu, Zongkai
    Wang, Donglin
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2023, 34 (08) : 5193 - 5199
  • [29] Tactical Rewind: Self-Correction via Backtracking in Vision-and-Language Navigation
    Ke, Liyiming
    Li, Xiujun
    Bisk, Yonatan
    Holtzman, Ari
    Gan, Zhe
    Liu, Jingjing
    Gao, Jianfeng
    Choi, Yejin
    Srinivasa, Siddhartha
    2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2019), 2019, : 6734 - 6742
  • [30] UniDCP: Unifying Multiple Medical Vision-Language Tasks via Dynamic Cross-Modal Learnable Prompts
    Zhan, Chenlu
    Zhang, Yufei
    Lin, Yu
    Wang, Gaoang
    Wang, Hongwei
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 9736 - 9748