Multi-task Learning of Hierarchical Vision-Language Representation

被引:25
|
作者
Duy-Kien Nguyen [1 ]
Okatani, Takayuki [1 ,2 ]
机构
[1] Tohoku Univ, Grad Sch Informat Sci, Sendai, Miyagi, Japan
[2] RIKEN Ctr AIP, Tokyo, Japan
关键词
D O I
10.1109/CVPR.2019.01074
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
It is still challenging to build an AI system that can perform tasks that involve vision and language at human level. So far, researchers have singled out individual tasks separately, for each of which they have designed networks and trained them on its dedicated datasets. Although this approach has seen a certain degree of success, it comes with difficulties of understanding relations among different tasks and transferring the knowledge learned for a task to others. We propose a multi-task learning approach that enables to learn vision-language representation that is shared by many tasks from their diverse datasets. The representation is hierarchical, and prediction for each task is computed from the representation at its corresponding level of the hierarchy. We show through experiments that our method consistently outperforms previous single-task-learning methods on image caption retrieval, visual question answering, and visual grounding. We also analyze the learned hierarchical representation by visualizing attention maps generated in our network.
引用
收藏
页码:10484 / 10493
页数:10
相关论文
共 50 条
  • [1] Align vision-language semantics by multi-task learning for multi-modal summarization
    Chenhao Cui
    Xinnian Liang
    Shuangzhi Wu
    Zhoujun Li
    [J]. Neural Computing and Applications, 2024, 36 (25) : 15653 - 15666
  • [2] VLUE: A Multi-Task Benchmark for Evaluating Vision-Language Models
    Zhou, Wangchunshu
    Zeng, Yan
    Diao, Shizhe
    Zhang, Xinsong
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [3] Multi-task prompt tuning with soft context sharing for vision-language models
    Ding, Kun
    Wang, Ying
    Liu, Pengzhang
    Yu, Qiang
    Zhang, Haojian
    Xiang, Shiming
    Pan, Chunhong
    [J]. NEUROCOMPUTING, 2024, 603
  • [4] Multi-Task Learning for Parsing the Alexa Meaning Representation Language
    Perera, Vittorio
    Chung, Tagyoung
    Kollar, Thomas
    Strubell, Emma
    [J]. THIRTY-SECOND AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTIETH INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / EIGHTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2018, : 5390 - 5397
  • [5] Multi-Task Paired Masking With Alignment Modeling for Medical Vision-Language Pre-Training
    Zhang, Ke
    Yang, Yan
    Yu, Jun
    Jiang, Hanliang
    Fan, Jianping
    Huang, Qingming
    Han, Weidong
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2024, 26 : 4706 - 4721
  • [6] Hierarchical Prompt Learning for Multi-Task Learning
    Liu, Yajing
    Lu, Yuning
    Liu, Hao
    An, Yaozu
    Xu, Zhuoran
    Yao, Zhuokun
    Zhang, Baofeng
    Xiong, Zhiwei
    Gui, Chenguang
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 10888 - 10898
  • [7] Active Multi-Task Representation Learning
    Chen, Yifang
    Du, Simon S.
    Jamieson, Kevin
    [J]. INTERNATIONAL CONFERENCE ON MACHINE LEARNING, VOL 162, 2022,
  • [8] Multi-Task Network Representation Learning
    Xie, Yu
    Jin, Peixuan
    Gong, Maoguo
    Zhang, Chen
    Yu, Bin
    [J]. FRONTIERS IN NEUROSCIENCE, 2020, 14
  • [9] Task-Oriented Multi-Modal Mutual Learning for Vision-Language Models
    Long, Sifan
    Zhao, Zhen
    Yuan, Junkun
    Tan, Zichang
    Liu, Jiangjiang
    Zhou, Luping
    Wang, Shengsheng
    Wang, Jingdong
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2023), 2023, : 21902 - 21912
  • [10] Hierarchical Multi-task learning framework for Isometric-Speech Language Translation
    Bhatnagar, Aakash
    Bhavsar, Nidhir
    Singh, Muskaan
    Motlicek, Petr
    [J]. PROCEEDINGS OF THE 19TH INTERNATIONAL CONFERENCE ON SPOKEN LANGUAGE TRANSLATION (IWSLT 2022), 2022, : 379 - 385