Unifying Vision-and-Language Tasks via Text Generation

被引:0
|
作者
Cho, Jaemin [1 ]
Lei, Jie [1 ]
Tan, Hao [1 ]
Bansal, Mohit [1 ]
机构
[1] Univ N Carolina, Chapel Hill, NC 27599 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j -min/VL-T5
引用
收藏
页数:12
相关论文
共 50 条
  • [31] UniTAB: Unifying Text and Box Outputs for Grounded Vision-Language Modeling
    Yang, Zhengyuan
    Gan, Zhe
    Wang, Jianfeng
    Hu, Xiaowei
    Ahmed, Faisal
    Liu, Zicheng
    Lu, Yumao
    Wang, Lijuan
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 521 - 539
  • [32] Memory-Adaptive Vision-and-Language Navigation
    He, Keji
    Jing, Ya
    Huang, Yan
    Lu, Zhihe
    An, Dong
    Wang, Liang
    PATTERN RECOGNITION, 2024, 153
  • [33] Vital information matching in vision-and-language navigation
    Jia, Zixi
    Yu, Kai
    Ru, Jingyu
    Yang, Sikai
    Coleman, Sonya
    FRONTIERS IN NEUROROBOTICS, 2022, 16
  • [34] MAGVLT: Masked Generative Vision-and-Language Transformer
    Kim, Sungwoong
    Jo, Daejin
    Lee, Donghoon
    Kim, Jongmin
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 23338 - 23348
  • [35] Masked Path Modeling for Vision-and-Language Navigation
    Dou, Zi-Yi
    Gao, Feng
    Peng, Nanyun
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS (EMNLP 2023), 2023, : 15255 - 15269
  • [36] Federated Learning for Vision-and-Language Grounding Problems
    Liu, Fenglin
    Wu, Xian
    Ge, Shen
    Fan, Wei
    Zou, Yuexian
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11572 - 11579
  • [37] VLMbench: A Compositional Benchmark for Vision-and-Language Manipulation
    Zheng, Kaizhi
    Chen, Xiaotong
    Jenkins, Odest Chadwicke
    Wang, Xin Eric
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 35 (NEURIPS 2022), 2022,
  • [38] Local Slot Attention for Vision-and-Language Navigation
    Zhuang, Yifeng
    Sun, Qiang
    Fu, Yanwei
    Chen, Lifeng
    Xue, Xiangyang
    PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, ICMR 2022, 2022, : 545 - 553
  • [39] Improved Speaker and Navigator for Vision-and-Language Navigation
    Wu, Zongkai
    Liu, Zihan
    Wang, Ting
    Wang, Donglin
    IEEE MULTIMEDIA, 2021, 28 (04) : 55 - 63
  • [40] Behavioral Analysis of Vision-and-Language Navigation Agents
    Yang, Zijiao
    Majumdar, Arjun
    Lee, Stefan
    2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR, 2023, : 2574 - 2582