Unifying Vision-and-Language Tasks via Text Generation

被引:0
|
作者
Cho, Jaemin [1 ]
Lei, Jie [1 ]
Tan, Hao [1 ]
Bansal, Mohit [1 ]
机构
[1] Univ N Carolina, Chapel Hill, NC 27599 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j -min/VL-T5
引用
收藏
页数:12
相关论文
共 50 条
  • [11] SkyEyeGPT: Unifying remote sensing vision-language tasks via instruction tuning with large language model
    Zhan, Yang
    Xiong, Zhitong
    Yuan, Yuan
    ISPRS JOURNAL OF PHOTOGRAMMETRY AND REMOTE SENSING, 2025, 221 : 64 - 77
  • [12] HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks
    Zhang, Zhengkun
    Guo, Wenya
    Meng, Xiaojun
    Wang, Yasheng
    Wang, Yadao
    Jiang, Xin
    Liu, Qun
    Yang, Zhenglu
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (ACL 2023), 2023, : 11442 - 11453
  • [13] ICU: Conquering Language Barriers in Vision-and-Language Modeling by Dividing the Tasks into Image Captioning and Language Understanding
    Wu, Guojun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 14740 - 14746
  • [14] Vision-and-Language Navigation via Latent Semantic Alignment Learning
    Wu, Siying
    Fu, Xueyang
    Wu, Feng
    Zha, Zheng-Jun
    IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 8406 - 8418
  • [15] Iterative Vision-and-Language Navigation
    Krantz, Jacob
    Banerjee, Shurjo
    Zhu, Wang
    Corso, Jason
    Anderson, Peter
    Lee, Stefan
    Thomason, Jesse
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 14921 - 14930
  • [16] ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks
    Lu, Jiasen
    Batra, Dhruv
    Parikh, Devi
    Lee, Stefan
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [17] Uni-NLX: Unifying Textual Explanations for Vision and Vision-Language Tasks
    Sammani, Fawaz
    Deligiannis, Nikos
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 4636 - 4641
  • [18] VL-ADAPTER: Parameter-Efficient Transfer Learning for Vision-and-Language Tasks
    Sung, Yi-Lin
    Cho, Jaemin
    Bansal, Mohit
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 5217 - 5227
  • [19] PanoGen plus plus : Domain-adapted text-guided panoramic environment generation for vision-and-language navigation
    Wang, Sen
    Zhou, Dongliang
    Xie, Liang
    Xu, Chao
    Yan, Ye
    Yin, Erwei
    NEURAL NETWORKS, 2025, 187
  • [20] Structure-Encoding Auxiliary Tasks for Improved Visual Representation in Vision-and-Language Navigation
    Kuo, Chia-Wen
    Ma, Chih-Yao
    Hoffman, Judy
    Kira, Zsolt
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1104 - 1113