Multimodal Graph Reasoning and Fusion for Video Question Answering

被引:1
|
作者
Zhang, Shuai [1 ]
Wang, Xingfu [1 ]
Hawbani, Ammar [1 ]
Zhao, Liang [2 ]
Alsamhi, Saeed Hamood [3 ,4 ]
机构
[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei, Peoples R China
[2] Shenyang Aerosp Univ, Sch Comp Sci, Shenyang, Peoples R China
[3] Natl Univ Ireland, Insight Ctr Data Analyt, Galway, Ireland
[4] IBB Univ, Ibb, Yemen
关键词
Video Question Answering; Multimodal Reasoning; Graph Neural Network; Graph Fusion;
D O I
10.1109/TrustCom56396.2022.00199
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Video Question Answering (VideoQA) is a challenging multimodal task that requires the ability to recognize visual elements and reason relations in spatial and temporal dimensions according to the given video and question. Most existing GNN-based methods model the visual elements in a video as graph structures and reason relations between them. Despite the remarkable results of their work, they neglected that the question also has graph structure dependencies, which can be used to reason about relations between the video and the question. In this work, we propose a multimodal graph reasoning and fusion network that builds three graph neural networks for appearance, motion, and text sequences, respectively, and hierarchically reasons and fuses nodes from different modalities. Our proposed method achieves superior performance to several state-of-the-art methods on three benchmark datasets.
引用
收藏
页码:1410 / 1415
页数:6
相关论文
共 50 条
  • [1] Reasoning with Heterogeneous Graph Alignment for Video Question Answering
    Jiang, Pin
    Han, Yahong
    [J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11109 - 11116
  • [2] Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
    Mao, Jianguo
    Jiang, Wenbin
    Wang, Xiangdong
    Feng, Zhifan
    Lyu, Yajuan
    Liu, Hong
    Zhu, Yong
    [J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3894 - 3904
  • [3] Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
    Zang, Chuanqi
    Wang, Hanqing
    Pei, Mingtao
    Liang, Wei
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19027 - 19036
  • [4] Multimodal feature fusion by relational reasoning and attention for visual question answering
    Zhang, Weifeng
    Yu, Jing
    Hu, Hua
    Hu, Haiyang
    Qin, Zengchang
    [J]. INFORMATION FUSION, 2020, 55 : 116 - 126
  • [5] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    [J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 189 - 200
  • [6] Multimodal Graph Transformer for Multimodal Question Answering
    He, Xuehai
    Wang, Xin Eric
    [J]. arXiv, 2023,
  • [7] DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
    Zhang, Weifeng
    Yu, Jing
    Zhao, Wenhong
    Ran, Chuan
    [J]. INFORMATION FUSION, 2021, 72 : 70 - 79
  • [8] Visual Question Answering on CLEVR Dataset via Multimodal Fusion and Relational Reasoning
    Allahyari, Abbas
    Borna, Keivan
    [J]. 2021 52ND ANNUAL IRANIAN MATHEMATICS CONFERENCE (AIMC), 2021, : 74 - 76
  • [9] Video Graph Transformer for Video Question Answering
    Xiao, Junbin
    Zhou, Pan
    Chua, Tat-Seng
    Yan, Shuicheng
    [J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
  • [10] DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
    Wang, Jianyu
    Bao, Bing-Kun
    Xu, Changsheng
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 24 : 3369 - 3380