Multi-level, multi-modal interactions for visual question answering over text in images

被引：0

作者：

Jincai Chen

Sheng Zhang

Jiangfeng Zeng

Fuhao Zou

Yuan-Fang Li

Tao Liu

Ping Lu

机构：

[1] Huazhong University of Science and Technology,Wuhan National Laboratory for Optoelectronics

[2] Huazhong University of Science and Technology,Key Laboratory of Information Storage System, School of Computer Science and Technology

[3] Huazhong University of Science and Technology,School of Computer Science and Technology

[4] Central China Normal University,School of Information Management

[5] Monash University,Department of Data Science and AI, Faculty of Information Technology

来源：

World Wide Web | 2022年 / 25卷

关键词：

Multi-modal feature interaction; Visual question answering; Self-attention mechanism; Optical character recognition; Multi-level feature fusion;

D O I：

暂无

中图分类号：

学科分类号：

摘要：

Visual scenes containing text in the TextVQA task require a simultaneous understanding of images, questions, and text in images to reason answers. However, most existing cross-modal tasks merely involve two modalities. There are thus few methods for modeling interactions across three modalities. To bridge this gap, we propose in this work cross- and intra-modal interaction modules for multiple (more than two) modalities, where scaled dot-product attention method is applied to model inter- and intra-modal relationship. In addition, we introduce guidance information to assist the attention method to learn a more accurate relationship distribution. We construct a Multi-level Complete Interaction (MLCI) model for the TextVQA task via stacking multiple blocks composed of our proposed interaction modules. We design a multi-level feature joint prediction approach to exploit output representations from each block in a complementary way to predict answers. The experimental results on the TextVQA dataset show that our model obtains a 5.42% improvement in accuracy more than the baseline. Extensive ablation studies are carried out for the comprehensive analysis of the proposed method. Our code is publicly available at https://github.com/zhangshengHust/mlci.

引用

页码：1607 / 1623

页数：16

共 50 条

[1] Multi-level, multi-modal interactions for visual question answering over text in images
Chen, Jincai
Zhang, Sheng
Zeng, Jiangfeng
Zou, Fuhao
Li, Yuan-Fang
Liu, Tao
Lu, Ping
[J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623
[2] Multi-modal Contextual Graph Neural Network for Text Visual Question Answering
Liang, Yaoyuan
Wang, Xin
Duan, Xuguang
Zhu, Wenwu
[J]. 2020 25TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2021, : 3491 - 3498
[3] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
Liu, Yun
Zhang, Xiaoming
Huang, Feiran
Cheng, Lei
Li, Zhoujun
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
[4] Multi-modal adaptive gated mechanism for visual question answering
Xu, Yangshuyi
Zhang, Lin
Shen, Xiang
[J]. PLOS ONE, 2023, 18 (06):
[5] Multi-scale relation reasoning for multi-modal Visual Question Answering
Wu, Yirui
Ma, Yuntao
Wan, Shaohua
[J]. SIGNAL PROCESSING-IMAGE COMMUNICATION, 2021, 96
[6] Multi-level Attention Networks for Visual Question Answering
Yu, Dongfei
Fu, Jianlong
Mei, Tao
Rui, Yong
[J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 4187 - 4195
[7] Multi-Modal fusion with multi-level attention for Visual Dialog
Zhang, Jingping
Wang, Qiang
Han, Yahong
[J]. INFORMATION PROCESSING & MANAGEMENT, 2020, 57 (04)
[8] Multi-modal spatial relational attention networks for visual question answering
Yao, Haibo
Wang, Lipeng
Cai, Chengtao
Sun, Yuxin
Zhang, Zhi
Luo, Yongkang
[J]. IMAGE AND VISION COMPUTING, 2023, 140
[9] Multi-Modal Fusion Transformer for Visual Question Answering in Remote Sensing
Siebert, Tim
Clasen, Kai Norman
Ravanbakhsh, Mahdyar
Demir, Beguem
[J]. IMAGE AND SIGNAL PROCESSING FOR REMOTE SENSING XXVIII, 2022, 12267
[10] The multi-modal fusion in visual question answering: a review of attention mechanisms
Lu, Siyu
Liu, Mingzhe
Yin, Lirong
Yin, Zhengtong
Liu, Xuan
Zheng, Wenfeng
[J]. PEERJ COMPUTER SCIENCE, 2023, 9

← 1 2 3 4 5 →