Multi-level, multi-modal interactions for visual question answering over text in images

被引:0
|
作者
Jincai Chen
Sheng Zhang
Jiangfeng Zeng
Fuhao Zou
Yuan-Fang Li
Tao Liu
Ping Lu
机构
[1] Huazhong University of Science and Technology,Wuhan National Laboratory for Optoelectronics
[2] Huazhong University of Science and Technology,Key Laboratory of Information Storage System, School of Computer Science and Technology
[3] Huazhong University of Science and Technology,School of Computer Science and Technology
[4] Central China Normal University,School of Information Management
[5] Monash University,Department of Data Science and AI, Faculty of Information Technology
来源
World Wide Web | 2022年 / 25卷
关键词
Multi-modal feature interaction; Visual question answering; Self-attention mechanism; Optical character recognition; Multi-level feature fusion;
D O I
暂无
中图分类号
学科分类号
摘要
Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.
引用
收藏
页码:1607 / 1623
页数:16
相关论文
共 50 条
  • [1] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623
  • [2] Multi-modal Contextual Graph Neural Network for Text Visual Question Answering
    Liang, Yaoyuan
    Wang, Xin
    Duan, Xuguang
    Zhu, Wenwu
    [J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3491 - 3498
  • [3] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [4] Multi-modal adaptive gated mechanism for visual question answering
    Xu, Yangshuyi
    Zhang, Lin
    Shen, Xiang
    [J]. PLOS ONE, 2023, 18 (06):
  • [5] Multi-scale relation reasoning for multi-modal Visual Question Answering
    Wu, Yirui
    Ma, Yuntao
    Wan, Shaohua
    [J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 96
  • [6] Multi-level Attention Networks for Visual Question Answering
    Yu, Dongfei
    Fu, Jianlong
    Mei, Tao
    Rui, Yong
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
  • [7] Multi-Modal fusion with multi-level attention for Visual Dialog
    Zhang, Jingping
    Wang, Qiang
    Han, Yahong
    [J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
  • [8] Multi-modal spatial relational attention networks for visual question answering
    Yao, Haibo
    Wang, Lipeng
    Cai, Chengtao
    Sun, Yuxin
    Zhang, Zhi
    Luo, Yongkang
    [J]. IMAGE AND VISION COMPUTING, 2023, 140
  • [9] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
    Siebert, Tim
    Clasen, Kai Norman
    Ravanbakhsh, Mahdyar
    Demir, Beguem
    [J]. IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
  • [10] The multi-modal fusion in visual question answering: a review of attention mechanisms
    Lu, Siyu
    Liu, Mingzhe
    Yin, Lirong
    Yin, Zhengtong
    Liu, Xuan
    Zheng, Wenfeng
    [J]. PEERJ COMPUTER SCIENCE, 2023, 9