Dynamic self-attention with vision synchronization networks for video question answering

被引：5

作者：

Liu, Yun ^{[1
]}

Zhang, Xiaoming ^{[2
]}

Huang, Feiran ^{[3
]}

Shen, Shixun ^{[1
]}

Tian, Peng ^{[1
]}

Li, Lang ^{[1
]}

Li, Zhoujun ^{[4
]}

机构：

[1] Moutai Inst, Dept Automat, Renhuai 564507, Guizhou, Peoples R China

[2] Beihang Univ, Sch Cyber Sci & Technol, Beijing 100191, Peoples R China

[3] Jinan Univ, Coll Cyber Secur, Guangzhou 510632, Peoples R China

[4] Beihang Univ, Sch Comp Sci & Engn, Beijing 100191, Peoples R China

来源：

PATTERN RECOGNITION | 2022年 / 132卷

关键词：

Video question answering; Dynamic self-attention; Vision synchronization;

D O I：

10.1016/j.patcog.2022.108959

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Video Question Answering (VideoQA) has gained increasing attention as an important task in understanding the rich spatio-temporal contents, i.e., the appearance and motion in the video. However, existing approaches mainly use the question to learn attentions over all the sampled appearance and motion features separately, which neglect two properties of VideoQA: (1) the answer to the question is often reflected on a few frames and video clips, and most video contents are superfluous; (2) appearance and motion features are usually concomitant and complementary to each other in time series. In this paper, we propose a novel VideoQA model, i.e., Dynamic Self-Attention with Vision Synchronization Networks (DSAVS), to address these problems. Specifically, a gated token selection mechanism is proposed to dynamically select the important tokens from appearance and motion sequences. These chosen tokens are fed into a self-attention mechanism to model the internal dependencies for more effective representation learning. To capture the correlation between the appearance and motion features, a vision synchronization block is proposed to synchronize the two types of vision features at the time slice level. Then, the visual objects can be correlated with their corresponding activities and the performance is further improved. Extensive experiments conducted on three public VideoQA data sets confirm the effectivity and superiority of our model compared with state-of-the-art methods.(c) 2022 Elsevier Ltd. All rights reserved.

引用

页数：12

共 50 条

[41] CopyBERT: A Unified Approach to Question Generation with Self-Attention
Varanasi, Stalin
Amin, Saadullah
Neumann, Guenter
NLP FOR CONVERSATIONAL AI, 2020, : 25 - 31
[42] Video Dialog via Multi-Grained Convolutional Self-Attention Context Networks
Jin, Weike
Zhao, Zhou
Gu, Mao
Yu, Jun
Xiao, Jun
Zhuang, Yueting
PROCEEDINGS OF THE 42ND INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL (SIGIR '19), 2019, : 465 - 474
[43] A self-attention based dynamic resource management for satellite-terrestrial networks
Lin, Tianhao
Luo, Zhiyong
CHINA COMMUNICATIONS, 2024, 21 (04) : 136 - 150
[44] Cross-Modal Self-Attention with Multi-Task Pre-Training for Medical Visual Question Answering
Gong, Haifan
Chen, Guanqi
Liu, Sishuo
Yu, Yizhou
Li, Guanbin
PROCEEDINGS OF THE 2021 INTERNATIONAL CONFERENCE ON MULTIMEDIA RETRIEVAL (ICMR '21), 2021, : 456 - 460
[45] ATTENTIONLITE: TOWARDS EFFICIENT SELF-ATTENTION MODELS FOR VISION
Kundu, Souvik
Sundaresan, Sairam
2021 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP 2021), 2021, : 2225 - 2229
[46] Vision Transformer Based on Reconfigurable Gaussian Self-attention
Zhao L.
Zhou J.-K.
Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (09): : 1976 - 1988
[47] A Self-Attention Based Dynamic Resource Management for Satellite-Terrestrial Networks
Lin Tianhao
Luo Zhiyong
China Communications, 2024, 21 (04) : 136 - 150
[48] Stand-Alone Self-Attention in Vision Models
Ramachandran, Prajit
Parmar, Niki
Vaswani, Ashish
Bello, Irwan
Levskaya, Anselm
Shlens, Jonathon
ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 32 (NIPS 2019), 2019, 32
[49] Long video question answering: A Matching-guided Attention Model
Wang, Weining
Huang, Yan
Wang, Liang
PATTERN RECOGNITION, 2020, 102
[50] SegEQA: Video Segmentation Based Visual Attention for Embodied Question Answering
Luo, Haonan
Lin, Guosheng
Liu, Zichuan
Liu, Fayao
Tang, Zhenmin
Yao, Yazhou
2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 9666 - 9675

← 1 2 3 4 5 →