Multi-Modal fusion with multi-level attention for Visual Dialog

被引:10
|
作者
Zhang, Jingping [1 ]
Wang, Qiang [2 ]
Han, Yahong [2 ]
机构
[1] Shanghai Theatre Acad, Digital Media Art Dept, Shanghai, Peoples R China
[2] Tianjin Univ, Coll Intelligence & Comp, Tianjin, Peoples R China
关键词
Visual Dialog; Multi-Modal; Multi-Level; Attention mechanism;
D O I
10.1016/j.ipm.2019.102152
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Given an input image, Visual Dialog is introduced to answer a sequence of questions in the form of a dialog. To generate accurate answers for questions in the dialog, we need to consider all information of the dialog history, the question, and the image. However, existing methods usually directly utilized the high-level semantic information of the whole sentence for the dialog history and the question, while ignoring the low-level detailed information of words in the sentence. Similarly, the detailed region information of the image in low level is also required to be considered for question answering. Therefore, we propose a novel visual dialog method, which focuses on both high-level and low-level information of the dialog history, the question, and the image. In our approach, we introduce three low-level attention modules, the goal of which is to enhance the representation of words in the sentence of the dialog history and the question based on the word-to-word connection and enrich the region information of the image based on the region-to-region relation. Besides, we design three high-level attention modules to select important words in the sentence of the dialog history and the question as the supplement of the detailed information for semantic understanding, as well as to select relevant regions in the image to provide the targeted visual information for question answering. We evaluate the proposed approach on two datasets: VisDial v0.9 and VisDial v1.0. The experimental results demonstrate that utilizing both low-level and high-level information really enhances the representation of inputs.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] MBIAN: Multi-level bilateral interactive attention network for multi-modal
    Sun, Kai
    Zhang, Jiangshe
    Wang, Jialin
    Xu, Shuang
    Zhang, Chunxia
    Hu, Junying
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 231
  • [2] Multi-level Fusion of Multi-modal Semantic Embeddings for Zero Shot Learning
    Kong, Zhe
    Wang, Xin
    Gao, Neng
    Zhang, Yifei
    Liu, Yuhan
    Tu, Chenyang
    [J]. PROCEEDINGS OF THE 2022 INTERNATIONAL CONFERENCE ON MULTIMODAL INTERACTION, ICMI 2022, 2022, : 310 - 318
  • [3] Multi-Level Multi-Modal Cross-Attention Network for Fake News Detection
    Ying, Long
    Yu, Hui
    Wang, Jinguang
    Ji, Yongze
    Qian, Shengsheng
    [J]. IEEE ACCESS, 2021, 9 : 132363 - 132373
  • [4] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9
  • [5] SiamMMF: multi-modal multi-level fusion object tracking based on Siamese networks
    Yang, Zhen
    Huang, Peng
    He, Dunyun
    Cai, Zhongwang
    Yin, Zhijian
    [J]. MACHINE VISION AND APPLICATIONS, 2023, 34 (01)
  • [6] MLSFF: Multi-level structural features fusion for multi-modal knowledge graph completion
    Zhai, Hanming
    Lv, Xiaojun
    Hou, Zhiwen
    Tong, Xin
    Bu, Fanliang
    [J]. MATHEMATICAL BIOSCIENCES AND ENGINEERING, 2023, 20 (08) : 14096 - 14116
  • [7] MLMFNet: A multi-level modality fusion network for multi-modal accelerated MRI reconstruction
    Zhou, Xiuyun
    Zhang, Zhenxi
    Du, Hongwei
    Qiu, Bensheng
    [J]. MAGNETIC RESONANCE IMAGING, 2024, 111 : 246 - 255
  • [8] SiamMMF: multi-modal multi-level fusion object tracking based on Siamese networks
    Zhen Yang
    Peng Huang
    Dunyun He
    Zhongwang Cai
    Zhijian Yin
    [J]. Machine Vision and Applications, 2023, 34
  • [9] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. World Wide Web, 2022, 25 (04) : 1607 - 1623
  • [10] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623