Multimodal Graph Reasoning and Fusion for Video Question Answering

被引：1

作者：

Zhang, Shuai ^{[1
]}

Wang, Xingfu ^{[1
]}

Hawbani, Ammar ^{[1
]}

Zhao, Liang ^{[2
]}

Alsamhi, Saeed Hamood ^{[3
,4
]}

机构：

[1] Univ Sci & Technol China, Sch Comp Sci & Technol, Hefei, Peoples R China

[2] Shenyang Aerosp Univ, Sch Comp Sci, Shenyang, Peoples R China

[3] Natl Univ Ireland, Insight Ctr Data Analyt, Galway, Ireland

[4] IBB Univ, Ibb, Yemen

来源：

2022 IEEE INTERNATIONAL CONFERENCE ON TRUST, SECURITY AND PRIVACY IN COMPUTING AND COMMUNICATIONS, TRUSTCOM | 2022年

关键词：

Video Question Answering; Multimodal Reasoning; Graph Neural Network; Graph Fusion;

D O I：

10.1109/TrustCom56396.2022.00199

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Video Question Answering (VideoQA) is a challenging multimodal task that requires the ability to recognize visual elements and reason relations in spatial and temporal dimensions according to the given video and question. Most existing GNN-based methods model the visual elements in a video as graph structures and reason relations between them. Despite the remarkable results of their work, they neglected that the question also has graph structure dependencies, which can be used to reason about relations between the video and the question. In this work, we propose a multimodal graph reasoning and fusion network that builds three graph neural networks for appearance, motion, and text sequences, respectively, and hierarchically reasons and fuses nodes from different modalities. Our proposed method achieves superior performance to several state-of-the-art methods on three benchmark datasets.

引用

页码：1410 / 1415

页数：6

共 50 条

[1] Reasoning with Heterogeneous Graph Alignment for Video Question Answering
Jiang, Pin
Han, Yahong
[J]. THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 11109 - 11116
[2] Dynamic Multistep Reasoning based on Video Scene Graph for Video Question Answering
Mao, Jianguo
Jiang, Wenbin
Wang, Xiangdong
Feng, Zhifan
Lyu, Yajuan
Liu, Hong
Zhu, Yong
[J]. NAACL 2022: THE 2022 CONFERENCE OF THE NORTH AMERICAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS: HUMAN LANGUAGE TECHNOLOGIES, 2022, : 3894 - 3904
[3] Discovering the Real Association: Multimodal Causal Reasoning in Video Question Answering
Zang, Chuanqi
Wang, Hanqing
Pei, Mingtao
Liang, Wei
[J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 19027 - 19036
[4] Multimodal feature fusion by relational reasoning and attention for visual question answering
Zhang, Weifeng
Yu, Jing
Hu, Hua
Hu, Haiyang
Qin, Zengchang
[J]. INFORMATION FUSION, 2020, 55 : 116 - 126
[5] Multimodal Graph Transformer for Multimodal Question Answering
He, Xuehai
Wang, Xin Eric
[J]. 17TH CONFERENCE OF THE EUROPEAN CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS, EACL 2023, 2023, : 189 - 200
[6] Multimodal Graph Transformer for Multimodal Question Answering
He, Xuehai
Wang, Xin Eric
[J]. arXiv, 2023,
[7] DMRFNet: Deep Multimodal Reasoning and Fusion for Visual Question Answering and explanation generation
Zhang, Weifeng
Yu, Jing
Zhao, Wenhong
Ran, Chuan
[J]. INFORMATION FUSION, 2021, 72 : 70 - 79
[8] Visual Question Answering on CLEVR Dataset via Multimodal Fusion and Relational Reasoning
Allahyari, Abbas
Borna, Keivan
[J]. 2021 52ND ANNUAL IRANIAN MATHEMATICS CONFERENCE (AIMC), 2021, : 74 - 76
[9] Video Graph Transformer for Video Question Answering
Xiao, Junbin
Zhou, Pan
Chua, Tat-Seng
Yan, Shuicheng
[J]. COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
[10] DualVGR: A Dual-Visual Graph Reasoning Unit for Video Question Answering
Wang, Jianyu
Bao, Bing-Kun
Xu, Changsheng
[J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 24 : 3369 - 3380

← 1 2 3 4 5 →