Dynamic self-attention with vision synchronization networks for video question answering

被引:5
|
作者
Liu, Yun [1 ]
Zhang, Xiaoming [2 ]
Huang, Feiran [3 ]
Shen, Shixun [1 ]
Tian, Peng [1 ]
Li, Lang [1 ]
Li, Zhoujun [4 ]
机构
[1] Moutai Inst, Dept Automat, Renhuai 564507, Guizhou, Peoples R China
[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China
[3] Jinan Univ, Coll Cyber Secur, Guangzhou 510632, Peoples R China
[4] Beihang Univ, Sch Comp Sci & Engn, Beijing 100191, Peoples R China
关键词
Video question answering; Dynamic self-attention; Vision synchronization;
D O I
10.1016/j.patcog.2022.108959
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Video Question Answering (VideoQA) has gained increasing attention as an important task in understanding the rich spatio-temporal contents, i.e., the appearance and motion in the video. However, existing approaches mainly use the question to learn attentions over all the sampled appearance and motion features separately, which neglect two properties of VideoQA: (1) the answer to the question is often reflected on a few frames and video clips, and most video contents are superfluous; (2) appearance and motion features are usually concomitant and complementary to each other in time series. In this paper, we propose a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), to address these problems. Specifically, a gated token selection mechanism is proposed to dynamically select the important tokens from appearance and motion sequences. These chosen tokens are fed into a self-attention mechanism to model the internal dependencies for more effective representation learning. To capture the correlation between the appearance and motion features, a vision synchronization block is proposed to synchronize the two types of vision features at the time slice level. Then, the visual objects can be correlated with their corresponding activities and the performance is further improved. Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of our model compared with state-of-the-art methods.(c) 2022 Elsevier Ltd. All rights reserved.
引用
收藏
页数:12
相关论文
共 50 条
  • [41] CopyBERT: A Unified Approach to Question Generation with Self-Attention
    Varanasi, Stalin
    Amin, Saadullah
    Neumann, Guenter
    NLP FOR CONVERSATIONAL AI, 2020, : 25 - 31
  • [42] Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks
    Jin, Weike
    Zhao, Zhou
    Gu, Mao
    Yu, Jun
    Xiao, Jun
    Zhuang, Yueting
    PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 465 - 474
  • [43] A self-attention based dynamic resource management for satellite-terrestrial networks
    Lin, Tianhao
    Luo, Zhiyong
    CHINA COMMUNICATIONS, 2024, 21 (04) : 136 - 150
  • [44] Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
    Gong, Haifan
    Chen, Guanqi
    Liu, Sishuo
    Yu, Yizhou
    Li, Guanbin
    PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 456 - 460
  • [45] ATTENTIONLITE: TOWARDS EFFICIENT SELF-ATTENTION MODELS FOR VISION
    Kundu, Souvik
    Sundaresan, Sairam
    2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2225 - 2229
  • [46] Vision Transformer Based on Reconfigurable Gaussian Self-attention
    Zhao L.
    Zhou J.-K.
    Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (09): : 1976 - 1988
  • [47] A Self-Attention Based Dynamic Resource Management for Satellite-Terrestrial Networks
    Lin Tianhao
    Luo Zhiyong
    China Communications, 2024, 21 (04) : 136 - 150
  • [48] Stand-Alone Self-Attention in Vision Models
    Ramachandran, Prajit
    Parmar, Niki
    Vaswani, Ashish
    Bello, Irwan
    Levskaya, Anselm
    Shlens, Jonathon
    ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
  • [49] Long video question answering: A Matching-guided Attention Model
    Wang, Weining
    Huang, Yan
    Wang, Liang
    PATTERN RECOGNITION, 2020, 102
  • [50] SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering
    Luo, Haonan
    Lin, Guosheng
    Liu, Zichuan
    Liu, Fayao
    Tang, Zhenmin
    Yao, Yazhou
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9666 - 9675