Multi-interaction Network with Object Relation for Video Question Answering

被引:50
|
作者
Jin, Weike [1 ]
Zhao, Zhou [1 ]
Gu, Mao [1 ]
Yu, Jun [2 ]
Xiao, Jun [1 ]
Zhuang, Yueting [1 ]
机构
[1] Zhejiang Univ, Hangzhou, Peoples R China
[2] Hangzhou Dianzi Univ, Hangzhou, Peoples R China
基金
中国国家自然科学基金; 浙江省自然科学基金;
关键词
video question answering; multi-interaction; object relation;
D O I
10.1145/3343031.3351065
中图分类号
TP39 [计算机的应用];
学科分类号
081203 ; 0835 ;
摘要
Video question answering is an important task for testing machine's ability of video understanding. The existing methods normally focus on the combination of recurrent and convolutional neural networks to capture spatial and temporal information of the video. Recently, some work has also shown that using attention mechanism can achieve better performance. In this paper, we propose a new model called Multi-interaction network for video question answering. There are two types of interactions in our model. The first type is the multi-modal interaction between the visual and textual information. The second type is the multi-level interaction inside the multi-modal interaction. Specifically, instead of using original self-attention, we propose a new attention mechanism called multi-interaction, which can capture both element-wise and segment-wise sequence interactions, simultaneously. And in addition to the normal frame-level interaction, we also take the object relations into consideration, in order to obtain more fine-grained information, such as motions and other potential relations among these objects. We evaluate our method on TGIF-QA and other two video QA datasets. The qualitative and quantitative experimental results show the effectiveness of our model, which achieves the new state-of-the-art performance.
引用
收藏
页码:1193 / 1201
页数:9
相关论文
共 50 条
  • [41] Local relation network with multilevel attention for visual question answering
    Sun, Bo
    Yao, Zeng
    Zhang, Yinghui
    Yu, Lejun
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2020, 73
  • [42] Affective question answering on video
    Ruwa, Nelson
    Mao, Qirong
    Wang, Liangjun
    Gou, Jianping
    NEUROCOMPUTING, 2019, 363 : 125 - 139
  • [43] A reasoning enhance network for muti-relation question answering
    Wenqing Wu
    Zhenfang Zhu
    Guangyuan Zhang
    Shiyong Kang
    Peiyu Liu
    Applied Intelligence, 2021, 51 : 4515 - 4524
  • [44] Semantic Relation Graph Reasoning Network for Visual Question Answering
    Lan, Hong
    Zhang, Pufen
    TWELFTH INTERNATIONAL CONFERENCE ON SIGNAL PROCESSING SYSTEMS, 2021, 11719
  • [45] AN AFFINITY-DRIVEN RELATION NETWORK FOR FIGURE QUESTION ANSWERING
    Zou, Jialong
    Wu, Guoli
    Xue, Taofeng
    Wu, Qingfeng
    2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [46] A reasoning enhance network for muti-relation question answering
    Wu, Wenqing
    Zhu, Zhenfang
    Zhang, Guangyuan
    Kang, Shiyong
    Liu, Peiyu
    APPLIED INTELLIGENCE, 2021, 51 (07) : 4515 - 4524
  • [47] Video Graph Transformer for Video Question Answering
    Xiao, Junbin
    Zhou, Pan
    Chua, Tat-Seng
    Yan, Shuicheng
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 39 - 58
  • [48] Deep Graph Convolutional Network with Dual-Branch and Multi-interaction
    Lou J.
    Ye H.
    Yang B.
    Li M.
    Cao F.
    Moshi Shibie yu Rengong Zhineng/Pattern Recognition and Artificial Intelligence, 2022, 35 (08): : 754 - 763
  • [49] Video Reference: A Video Question Answering Engine
    Gao, Lei
    Li, Guangda
    Zheng, Yan-Tao
    Hong, Richang
    Chua, Tat-Seng
    ADVANCES IN MULTIMEDIA MODELING, PROCEEDINGS, 2010, 5916 : 799 - +
  • [50] Stepwise relation prediction with dynamic reasoning network for multi-hop knowledge graph question answering
    Cui, Hai
    Peng, Tao
    Bao, Tie
    Han, Ridong
    Han, Jiayu
    Liu, Lu
    APPLIED INTELLIGENCE, 2023, 53 (10) : 12340 - 12354