Unifying Vision-and-Language Tasks via Text Generation

被引:0
|
作者
Cho, Jaemin [1 ]
Lei, Jie [1 ]
Tan, Hao [1 ]
Bansal, Mohit [1 ]
机构
[1] Univ N Carolina, Chapel Hill, NC 27599 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j -min/VL-T5
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Towards Unifying Medical Vision-and-Language Pre-training via Soft Prompts
    Chen, Zhihong
    Diao, Shizhe
    Wang, Benyou
    Li, Guanbin
    Wan, Xiang
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 23346 - 23356
  • [2] DU-VLG: Unifying Vision-and-Language Generation via Dual Sequence-to-Sequence Pre-training
    Huang, Luyang
    Niu, Guocheng
    Liu, Jiachen
    Xiao, Xinyan
    Wu, Hua
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), 2022, : 2552 - 2566
  • [3] PANOGEN: Text-Conditioned Panoramic Environment Generation for Vision-and-Language Navigation
    Li, Jialu
    Bansal, Mohit
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 36 (NEURIPS 2023), 2023,
  • [4] CLiMB: A Continual Learning Benchmark for Vision-and-Language Tasks
    Srinivasan, Tejas
    Chang, Ting-Yun
    Alva, Leticia Pinto
    Chochlakis, Georgios
    Rostami, Mohammad
    Thomason, Jesse
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [5] Towards Lightweight Transformer Via Group-Wise Transformation for Vision-and-Language Tasks
    Luo, Gen
    Zhou, Yiyi
    Sun, Xiaoshuai
    Wang, Yan
    Cao, Liujuan
    Wu, Yongjian
    Huang, Feiyue
    Ji, Rongrong
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2022, 31 : 3386 - 3398
  • [6] Scaling Data Generation in Vision-and-Language Navigation
    Wang, Zun
    Li, Jialu
    Hong, Yicong
    Wang, Yi
    Wu, Qi
    Bansal, Mohit
    Gould, Stephen
    Tan, Hao
    Qiao, Yu
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 11975 - 11986
  • [7] Vision-and-Language Navigation via Causal Learning
    Wang, Liuyi
    He, Zongtao
    Dang, Ronghao
    Shen, Mengjiao
    Liu, Chengju
    Chen, Qijun
    2024 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2024, : 13139 - 13150
  • [8] Vision-and-Language Navigation: A Survey of Tasks, Methods, and Future Directions
    Gu, Jing
    Stefani, Eliana
    Wu, Qi
    Thomason, Jesse
    Wang, Xin Eric
    PROCEEDINGS OF THE 60TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2022), VOL 1: (LONG PAPERS), 2022, : 7606 - 7623
  • [9] Multimodal Text Style Transfer for Outdoor Vision-and-Language Navigation
    Zhu, Wanrong
    Wang, Xin Eric
    Fu, Tsu-Jui
    Yan, An
    Narayana, Pradyumna
    Sone, Kazoo
    Basu, Sugato
    Wang, William Yang
    16TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EACL 2021), 2021, : 1207 - 1221
  • [10] Multimodal high-order relational network for vision-and-language tasks
    Pan, Hao
    Huang, Jun
    NEUROCOMPUTING, 2022, 492 : 62 - 75