Unifying Vision-and-Language Tasks via Text Generation

被引:0
|
作者
Cho, Jaemin [1 ]
Lei, Jie [1 ]
Tan, Hao [1 ]
Bansal, Mohit [1 ]
机构
[1] Univ N Carolina, Chapel Hill, NC 27599 USA
关键词
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Existing methods for vision-and-language learning typically require designing task-specific architectures and objectives for each task. For example, a multi-label answer classifier for visual question answering, a region scorer for referring expression comprehension, and a language decoder for image captioning, etc. To alleviate these hassles, in this work, we propose a unified framework that learns different tasks in a single architecture with the same language modeling objective, i.e., multimodal conditional text generation, where our models learn to generate labels in text based on the visual and textual inputs. On 7 popular vision-and-language benchmarks, including visual question answering, referring expression comprehension, visual commonsense reasoning, most of which have been previously modeled as discriminative tasks, our generative approach (with a single unified architecture) reaches comparable performance to recent task-specific state-of-the-art vision-and-language models. Moreover, our generative approach shows better generalization ability on questions that have rare answers. Also, we show that our framework allows multi-task learning in a single architecture with a single set of parameters, achieving similar performance to separately optimized single-task models. Our code is publicly available at: https://github.com/j -min/VL-T5
引用
收藏
页数:12
相关论文
共 50 条
  • [41] Transferable Representation Learning in Vision-and-Language Navigation
    Huang, Haoshuo
    Jain, Vihan
    Mehta, Harsh
    Ku, Alexander
    Magalhaes, Gabriel
    Baldridge, Jason
    Ie, Eugene
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 7403 - 7412
  • [42] VLSlice: Interactive Vision-and-Language Slice Discovery
    Slyman, Eric
    Kahng, Minsuk
    Lee, Stefan
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15245 - 15255
  • [43] ENVEDIT: Environment Editing for Vision-and-Language Navigation
    Li, Jialu
    Tan, Hao
    Bansal, Mohit
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 15386 - 15396
  • [44] Diagnosing the Environment Bias in Vision-and-Language Navigation
    Zhang, Yubo
    Tan, Hao
    Bansal, Mohit
    PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 890 - 897
  • [45] KAT: A Knowledge Augmented Transformer for Vision-and-Language
    Gui, Liangke
    Wang, Borui
    Huang, Qiuyuan
    Hauptmann, Alexander
    Bisk, Yonatan
    Gao, Jianfeng
    NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 956 - 968
  • [46] Topological Planning with Transformers for Vision-and-Language Navigation
    Chen, Kevin
    Chen, Junshen K.
    Chuang, Jo
    Vazquez, Marynel
    Savarese, Silvio
    2021 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION, CVPR 2021, 2021, : 11271 - 11281
  • [47] Task-Attentive Transformer Architecture for Continual Learning of Vision-and-Language Tasks Using Knowledge Distillation
    Cai, Yuliang
    Thomason, Jesse
    Rostami, Mohammad
    FINDINGS OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS - EMNLP 2023, 2023, : 6986 - 7000
  • [48] AerialVLN (sic) : Vision-and-Language Navigation for UAVs
    Liu, Shubo
    Zhang, Hongsheng
    Qi, Yuankai
    Wang, Peng
    Zhang, Yanning
    Wu, Qi
    2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 15338 - 15348
  • [49] Vision-and-Language or Vision-for-Language? On Cross-Modal Influence in Multimodal Transformers
    Frank, Stella
    Bugliarello, Emanuele
    Elliott, Desmond
    2021 CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING (EMNLP 2021), 2021, : 9847 - 9857
  • [50] Unifying Visual and Vision-Language Tracking via Contrastive Learning
    Ma, Yinchao
    Tang, Yuyang
    Yang, Wenfei
    Zhang, Tianzhu
    Zhang, Jinpeng
    Kang, Mengxue
    THIRTY-EIGHTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, VOL 38 NO 5, 2024, : 4107 - 4116