Dynamic self-attention with vision synchronization networks for video question answering

被引:5
|
作者
Liu, Yun [1 ]
Zhang, Xiaoming [2 ]
Huang, Feiran [3 ]
Shen, Shixun [1 ]
Tian, Peng [1 ]
Li, Lang [1 ]
Li, Zhoujun [4 ]
机构
[1] Moutai Inst, Dept Automat, Renhuai 564507, Guizhou, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
[3] Jinan Univ, Coll Cyber Secur, Guangzhou 510632, Peoples R China
[4] Beihang Univ, Sch Comp Sci & Engn, Beijing 100191, Peoples R China
关键词
Video question answering; Dynamic self-attention; Vision synchronization;
D O I
10.1016/j.patcog.2022.108959
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Question Answering (VideoQA) has gained increasing attention as an important task in understanding the rich spatio-temporal contents, i.e., the appearance and motion in the video. However, existing approaches mainly use the question to learn attentions over all the sampled appearance and motion features separately, which neglect two properties of VideoQA: (1) the answer to the question is often reflected on a few frames and video clips, and most video contents are superfluous; (2) appearance and motion features are usually concomitant and complementary to each other in time series. In this paper, we propose a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), to address these problems. Specifically, a gated token selection mechanism is proposed to dynamically select the important tokens from appearance and motion sequences. These chosen tokens are fed into a self-attention mechanism to model the internal dependencies for more effective representation learning. To capture the correlation between the appearance and motion features, a vision synchronization block is proposed to synchronize the two types of vision features at the time slice level. Then, the visual objects can be correlated with their corresponding activities and the performance is further improved. Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of our model compared with state-of-the-art methods.(c) 2022 Elsevier Ltd. All rights reserved.
引用
收藏
页数:12
相关论文
共 50 条
  • [1] Stacked Self-Attention Networks for Visual Question Answering
    Sun, Qiang
    Fu, Yanwei
    ICMR'19: PROCEEDINGS OF THE 2019 ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL, 2019, : 207 - 211
  • [2] Dual self-attention with co-attention networks for visual question answering
    Liu, Yun
    Zhang, Xiaoming
    Zhang, Qianyun
    Li, Chaozhuo
    Huang, Feiran
    Tang, Xianghong
    Li, Zhoujun
    PATTERN RECOGNITION, 2021, 117 (117)
  • [3] Beyond RNNs: Positional Self-Attention with Co-Attention for Video Question Answering
    Li, Xiangpeng
    Song, Jingkuan
    Gao, Lianli
    Liu, Xianglong
    Huang, Wenbing
    He, Xiangnan
    Gan, Chuang
    THIRTY-THIRD AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FIRST INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE / NINTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2019, : 8658 - 8665
  • [4] Local self-attention in transformer for visual question answering
    Shen, Xiang
    Han, Dezhi
    Guo, Zihan
    Chen, Chongqing
    Hua, Jie
    Luo, Gaofeng
    APPLIED INTELLIGENCE, 2023, 53 (13) : 16706 - 16723
  • [5] Local self-attention in transformer for visual question answering
    Xiang Shen
    Dezhi Han
    Zihan Guo
    Chongqing Chen
    Jie Hua
    Gaofeng Luo
    Applied Intelligence, 2023, 53 : 16706 - 16723
  • [6] Open-Ended Long-Form Video Question Answering via Hierarchical Convolutional Self-Attention Networks
    Zhang, Zhu
    Zhao, Zhou
    Lin, Zhijie
    Song, Jingkuan
    He, Xiaofei
    PROCEEDINGS OF THE TWENTY-EIGHTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2019, : 4383 - 4389
  • [7] A novel self-attention enriching mechanism for biomedical question answering
    Kaddari, Zakaria
    Bouchentouf, Toumi
    EXPERT SYSTEMS WITH APPLICATIONS, 2023, 225
  • [8] Cascade transformers with dynamic attention for video question answering
    Jiang, Yimin
    Yan, Tingfei
    Yao, Mingze
    Wang, Huibing
    Liu, Wenzhe
    COMPUTER VISION AND IMAGE UNDERSTANDING, 2024, 242
  • [9] Video Question Answering by Frame Attention
    Fang, Jiannan
    Sun, Lingling
    Wang, Yaqi
    ELEVENTH INTERNATIONAL CONFERENCE ON DIGITAL IMAGE PROCESSING (ICDIP 2019), 2019, 11179
  • [10] Research on Question Answering System Based on Bi-LSTM and Self-attention Mechanism
    Xiang, Hao
    Gu, Jinguang
    2020 IEEE 7TH INTERNATIONAL CONFERENCE ON INDUSTRIAL ENGINEERING AND APPLICATIONS (ICIEA 2020), 2020, : 726 - 730