Multi-interaction Network with Object Relation for Video Question Answering

被引:50
|
作者
Jin, Weike [1 ]
Zhao, Zhou [1 ]
Gu, Mao [1 ]
Yu, Jun [2 ]
Xiao, Jun [1 ]
Zhuang, Yueting [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Hangzhou Dianzi Univ, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 浙江省自然科学基金;
关键词
video question answering; multi-interaction; object relation;
D O I
10.1145/3343031.3351065
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video question answering is an important task for testing machine's ability of video understanding. The existing methods normally focus on the combination of recurrent and convolutional neural networks to capture spatial and temporal information of the video. Recently, some work has also shown that using attention mechanism can achieve better performance. In this paper, we propose a new model called Multi-interaction network for video question answering. There are two types of interactions in our model. The first type is the multi-modal interaction between the visual and textual information. The second type is the multi-level interaction inside the multi-modal interaction. Specifically, instead of using original self-attention, we propose a new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously. And in addition to the normal frame-level interaction, we also take the object relations into consideration, in order to obtain more fine-grained information, such as motions and other potential relations among these objects. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves the new state-of-the-art performance.
引用
收藏
页码:1193 / 1201
页数:9
相关论文
共 50 条
  • [1] Graph-Based Multi-Interaction Network for Video Question Answering
    Gu, Mao
    Zhao, Zhou
    Jin, Weike
    Hong, Richang
    Wu, Fei
    IEEE TRANSACTIONS ON IMAGE PROCESSING, 2021, 30 : 2758 - 2770
  • [2] Multi-Granularity Interaction and Integration Network for Video Question Answering
    Wang, Yuanyuan
    Liu, Meng
    Wu, Jianlong
    Nie, Liqiang
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (12) : 7684 - 7695
  • [3] Pairwise VLAD Interaction Network for Video Question Answering
    Wang, Hui
    Guo, Dan
    Hua, Xian-Sheng
    Wang, Meng
    PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5119 - 5127
  • [4] Action-Centric Relation Transformer Network for Video Question Answering
    Zhang, Jipeng
    Shao, Jie
    Cao, Rui
    Gao, Lianli
    Xu, Xing
    Shen, Heng Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 63 - 74
  • [5] Multi-Attention Relation Network for Figure Question Answering
    Li, Ying
    Wu, Qingfeng
    Chen, Bin
    KNOWLEDGE SCIENCE, ENGINEERING AND MANAGEMENT, PT II, 2022, 13369 : 667 - 680
  • [6] Multi-Scale Progressive Attention Network for Video Question Answering
    Guo, Zhicheng
    Zhao, Jiaxuan
    Jiao, Licheng
    Liu, Xu
    Li, Lingling
    ACL-IJCNLP 2021: THE 59TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 11TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING, VOL 2, 2021, : 973 - 978
  • [7] Multi-Scale Progressive Attention Network for Video Question Answering
    Guo, Zhicheng
    Zhao, Jiaxuan
    Jiao, Licheng
    Liu, Xu
    Li, Lingling
    ACL-IJCNLP 2021 - 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Proceedings of the Conference, 2021, 2 : 873 - 878
  • [8] Advancing Video Question Answering with a Multi-modal and Multi-layer Question Enhancement Network
    Liu, Meng
    Zhang, Fenglei
    Luo, Xin
    Liu, Fan
    Wei, Yinwei
    Nie, Liqiang
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 3985 - 3993
  • [9] Text-Guided Object Detector for Multi-modal Video Question Answering
    Shen, Ruoyue
    Inoue, Nakamasa
    Shinoda, Koichi
    2023 IEEE/CVF WINTER CONFERENCE ON APPLICATIONS OF COMPUTER VISION (WACV), 2023, : 1032 - 1042
  • [10] Hierarchical synchronization with structured multi-granularity interaction for video question answering
    Qi, Shanshan
    Yang, Luxi
    Li, Chunguo
    NEUROCOMPUTING, 2024, 582