Multi-modal Contextual Graph Neural Network for Text Visual Question Answering

被引:4
|
作者
Liang, Yaoyuan [1 ]
Wang, Xin [2 ,3 ]
Duan, Xuguang [2 ]
Zhu, Wenwu [2 ,3 ]
机构
[1] Beijing Univ Posts & Telecommun, Beijing, Peoples R China
[2] Tsinghua Univ, Dept Comp Sci & Technol, Beijing, Peoples R China
[3] Peng Cheng Lab, Shenzhen, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICPR48806.2021.9412891
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Text visual question answering (TextVQA) targets at answering the question related to texts appearing in the given images, posing more challenges than VQA by requiring a deeper recognition and understanding of various shapes of human-readable scene texts as well as their meanings in different contexts. Existing works on TextVQA suffer from two weaknesses: i) scene texts and non-textual objects are processed separately and independently without considering their mutual interactions during the question understanding and answering process, ii) scene texts are encoded only through word embeddings without taking the corresponding visual appearance features as well as their potential relationships with other non-textual objects in the images into account. To overcome the weakness of existing works, we propose a novel multi-modal contextual graph neural network (MCG) model for TextVQA. The proposed MCG model can capture the relationships between visual features of scene texts and non-textual objects in the given images as well as utilize richer sources of multi-modal features to improve the model performance. In particular, we encode the scene texts into richer features containing textual, visual and positional features, then model the visual relations between scene texts and non-textual objects through a contextual graph neural network. Our extensive experiments on real-world dataset demonstrate the advantages of the proposed MCG model over baseline approaches.
引用
收藏
页码:3491 / 3498
页数:8
相关论文
共 50 条
  • [1] Visual question answering model based on graph neural network and contextual attention
    Sharma, Himanshu
    Jalal, Anand Singh
    [J]. IMAGE AND VISION COMPUTING, 2021, 110
  • [2] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. World Wide Web, 2022, 25 (04) : 1607 - 1623
  • [3] Multi-level, multi-modal interactions for visual question answering over text in images
    Jincai Chen
    Sheng Zhang
    Jiangfeng Zeng
    Fuhao Zou
    Yuan-Fang Li
    Tao Liu
    Ping Lu
    [J]. World Wide Web, 2022, 25 : 1607 - 1623
  • [4] Multi-level, multi-modal interactions for visual question answering over text in images
    Chen, Jincai
    Zhang, Sheng
    Zeng, Jiangfeng
    Zou, Fuhao
    Li, Yuan-Fang
    Liu, Tao
    Lu, Ping
    [J]. WORLD WIDE WEB-INTERNET AND WEB INFORMATION SYSTEMS, 2022, 25 (04): : 1607 - 1623
  • [5] Knowledge-Based Visual Question Answering Using Multi-Modal Semantic Graph
    Jiang, Lei
    Meng, Zuqiang
    [J]. ELECTRONICS, 2023, 12 (06)
  • [6] Adversarial Learning With Multi-Modal Attention for Visual Question Answering
    Liu, Yun
    Zhang, Xiaoming
    Huang, Feiran
    Cheng, Lei
    Li, Zhoujun
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2021, 32 (09) : 3894 - 3908
  • [7] Multi-modal adaptive gated mechanism for visual question answering
    Xu, Yangshuyi
    Zhang, Lin
    Shen, Xiang
    [J]. PLOS ONE, 2023, 18 (06):
  • [8] Interpretable medical image Visual Question Answering via multi-modal relationship graph learning
    Hu, Xinyue
    Gu, Lin
    Kobayashi, Kazuma
    Liu, Liangchen
    Zhang, Mengliang
    Harada, Tatsuya
    Summers, Ronald M.
    Zhu, Yingying
    [J]. MEDICAL IMAGE ANALYSIS, 2024, 97
  • [9] Multi-modal Question Answering System Driven by Domain Knowledge Graph
    Zhao, Zhengwei
    Wang, Xiaodong
    Xu, Xiaowei
    Wang, Qing
    [J]. 5TH INTERNATIONAL CONFERENCE ON BIG DATA COMPUTING AND COMMUNICATIONS (BIGCOM 2019), 2019, : 43 - 47
  • [10] Holistic Multi-Modal Memory Network for Movie Question Answering
    Wang, Anran
    Anh Tuan Luu
    Foo, Chuan-Sheng
    Zhu, Hongyuan
    Tay, Yi
    Chandrasekhar, Vijay
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2020, 29 : 489 - 499