Dual Path Multi-Modal High-Order Features for Textual Content based Visual Question Answering

被引:0
|
作者
Li, Yanan [1 ]
Lin, Yuetan [2 ]
Zhao, Honghui [3 ]
Wang, Donghui [3 ]
机构
[1] Zhejiang Lab, Artificial Intelligence Inst, Hangzhou, Peoples R China
[2] Tencent YouTu Lab, Shanghai, Peoples R China
[3] Zhejiang Univ, Artificial Intelligence Inst, Hangzhou, Peoples R China
基金
中国国家自然科学基金;
关键词
D O I
10.1109/ICPR48806.2021.9412231
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
As a typical cross-modal problem, visual question answering (VQA) has received increasing attention from the communities of computer vision and natural language processing. Reading and reasoning about texts and visual contents in the images is a burgeoning and important research topic in VQA, especially for the visually impaired assistance applications. Given an image, it aims to predict an answer to a provided natural language question closely related to its textual contents. In this paper, we propose a novel end-to-end textual content based VQA model, which grounds question answering both on the visual and textual information. After encoding the image, question and recognized text words, it uses multi-modal factorized high-order modules and the attention mechanism to fuse question-image and question-text features respectively. The complex correlations among different features can be captured efficiently. To ensure the model's extendibility, it embeds candidate answers and recognized texts in a semantic embedding space and adopts semantic embedding based classifier to perform answer prediction. Extensive experiments on the newly proposed benchmark TextVQA demonstrate that the proposed model can achieve promising results.
引用
收藏
页码:4324 / 4331
页数:8
相关论文
共 50 条
  • [41] Visual Question Answering Research on Multi-layer Attention Mechanism Based on Image Target Features
    Cao, Danyang
    Ren, Xu
    Zhu, Menggui
    Song, Wei
    [J]. HUMAN-CENTRIC COMPUTING AND INFORMATION SCIENCES, 2021, 11
  • [42] Mucko: Multi-Layer Cross-Modal Knowledge Reasoning for Fact-based Visual Question Answering
    Zhu, Zihao
    Yu, Jing
    Wang, Yujing
    Sun, Yajing
    Hu, Yue
    Wu, Qi
    [J]. PROCEEDINGS OF THE TWENTY-NINTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2020, : 1097 - 1103
  • [43] DSAMR: Dual-Stream Attention Multi-hop Reasoning for knowledge-based visual question answering
    Sun, Yanhan
    Zhu, Zhenfang
    Zuo, Zicheng
    Li, Kefeng
    Gong, Shuai
    Qi, Jiangtao
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2024, 245
  • [44] Robust joint clustering of multi-omics single-cell data via multi-modal high-order neighborhood Laplacian matrix optimization
    Jiang, Hao
    Zhan, Senwen
    Ching, Wai-Ki
    Chen, Luonan
    [J]. BIOINFORMATICS, 2023, 39 (07)
  • [45] Low and high-level visual feature-based apple detection from multi-modal images
    J. P. Wachs
    H. I. Stern
    T. Burks
    V. Alchanatis
    [J]. Precision Agriculture, 2010, 11 : 717 - 735
  • [46] Low and high-level visual feature-based apple detection from multi-modal images
    Wachs, J. P.
    Stern, H. I.
    Burks, T.
    Alchanatis, V.
    [J]. PRECISION AGRICULTURE, 2010, 11 (06) : 717 - 735
  • [47] Low-Visibility Vehicle-Road Environment Perception Based on the Multi-Modal Visual Features Fusion of Polarization and Infrared
    Wang, Hui-Feng
    Jiao, Yun-Mei
    Hao, Ting
    Shan, Yuan-He
    Song, Shang-Zhen
    Huang, He
    [J]. IEEE TRANSACTIONS ON INTELLIGENT TRANSPORTATION SYSTEMS, 2023, 24 (11) : 11997 - 12013
  • [48] M3HOGAT: A Multi-View Multi-Modal Multi-Scale High-Order Graph Attention Network for Microbe-Disease Association Prediction
    Wang, Shuang
    Liu, Jin-Xing
    Li, Feng
    Wang, Juan
    Gao, Ying-Lian
    [J]. IEEE Journal of Biomedical and Health Informatics, 2024, 28 (10) : 6259 - 6267
  • [49] Neural Recommendation Algorithm Using Combinations of Low and High-Order Features Based on Multi-Attention Mechanism
    Cui, Shaoguo
    Du, Xiao
    Yang, Zetian
    [J]. Computer Engineering and Applications, 2023, 59 (08): : 192 - 199
  • [50] Surround Suppression of V1 Neurons Mediates Orientation-Based Representation of High-Order Visual Features
    Tanaka, Hiroki
    Ohzawa, Izumi
    [J]. JOURNAL OF NEUROPHYSIOLOGY, 2009, 101 (03) : 1444 - 1462