Embedding VLAD in Transformer for Video Question Answering

被引：0

作者：

Guo D. ^{[1
,2
,3
,4
]}

Yao S.-T. ^{[1
]}

Wang H. ^{[1
]}

Wang M. ^{[1
,2
,3
,4
]}

机构：

[1] School of Computer and Information Engineering, Hefei University of Technology, Hefei

[2] Institute of Artificial Intelligence, Hefei Comprehensive National Science Center, Hefei

[3] Key Laboratory of Knowledge Engineering with Big Data（Hefei University of Technology）, Ministry of Education, Hefei

[4] Intelligent Interconnected Systems Laboratory of Anhui Province（Hefei University of Technology）, Hefei

来源：

Jisuanji Xuebao/Chinese Journal of Computers | 2023年 / 46卷 / 04期

关键词：

aggregated descriptors; deep learning; multi-modal data; transformer network; video question answering;

D O I：

10.11897/SP.J.1016.2023.00671

中图分类号：

学科分类号：

摘要：

Video question answering（VideoQA）is a typical cross-modal understanding task. Its challenge lies in how to learn appropriate multimodal representation and cross-modal correlation for answer inference. Most existing video question answering methods focus on the latter，e. g.，relationship learning between each video frame or clip and word. In this work， we devote to advanced feature embedding of both video and query. We develop a clustering-based VLAD technique for VideoQA. The novelty of our work is the joint exploitation of temporal aggregation and correlation in multimodality. We propose an end-to-end trainable Transformed VLAD embedding network， named TVLAD-Net. TVLAD-Net constructs a differentiable aggregation network module（i. e.， convolutional Residual-less VLAD Block）to generate compact VLAD descriptors（transforming N frames，clips or words to compact K descriptors while K < N），and realizes multi-head attention to correlate multimodal RVLAD descriptors. The characteristics are to eliminate redundant and invalid clues in the feature sequence and ensure diversity with multiple to-be-learned descriptors（corresponding to multiple clustering cells）. To be specific，at first，we argue that a suitable representation should effectively exhibit the potential core semantic clues of sequence data. Based on this rule， we focus on the temporal aggregation of multimodality to extract core descriptors of data. For either videos or questions，we develop a learnable clustering-based Residual VLAD encoder to summarize each entire feature sequence into compact descriptors， respectively. Each descriptor can be deemed as a weighted aggregation over the entire feature sequence（a global perspective of unimodality）. Multiple descriptors mean viewing global sequence serval times. It ensures the rich perspectives of semantic summarization. In this work，we consider the summarization of visual frame features，clip features，the combined frame & clip features of video， and word features of question. Second， we construct a unified Transformed module to realize multimodal descriptor interaction. To avoid irrelevant or redundant semantics of both visual and textual descriptors， we leverage multi-head attention in the Transformer architecture to control informative flows from these descriptors. The proposed transformed VLAD embedding module performs the context correlation of both inter-modality and intra-modality. Finally， each answer inference decoder is constructed for specific question types. The questions in VideoQA can be divided into the following three types：1）Multi-choice task，2）Open counting task and 3）Open word task. We use the corresponding decoder for each specific question type to infer the final answer. We evaluated TVLAD-Net on three VideoQA benchmark datasets， TGIF-QA， MSVD-QA， and MSRVTT-QA. The experimental results show that the proposed method achieves high accuracy of answer reasoning. There is a performance improvement of 2% to 5% compared with the existing methods. To summarize，the main contributions are summarized as follows：1）by introducing the clustering-based VLAD aggregation into the differentiable convolution network， we refine the original features and generate advanced multimodal descriptors for VideoQA； 2） the multi-head operation in transformed VLAD embedding ensures the context correlation of both inter-modality and intra-modality. Either visual or textual descriptors， descriptors with similar or consistent semantics gather round；3）extensive experiments demonstrate the effectiveness of TVLAD-Net over other approaches on three benchmark datasets. © 2023 Science Press. All rights reserved.

引用

页码：671 / 689

页数：18

共 63 条

[1] Pei W，, Zhang J，, Wang X，, Ke L，, Shen X, Tai Y-W., Memory-attended recurrent network for video captioning, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8347-8356, (2019)
[2] Peng-Jie Tang, Han-Li Wang, From video to language：survey of video captioning and description, Acta Automatica Sinica, 47, pp. 1-23, (2021)
[3] Antol S, ，Agrawal A，Lu J，Mitchell M，Batra D，Zitnick C L and Parikh D. Vqa：Visual question answering, Proceedings of the IEEE International Conference on Computer Vision, pp. 2425-2433, (2015)
[4] Jang Y, Song Y, Yu Y，, Kim Y, Kim G., TGIF-QA：toward spatio-temporal reasoning in visual question answering, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1359-1367, (2017)
[5] Lei J，, Yu L，, Bansal M, Berg T L., TVQA：localized，compositional video question answering, Proceedings of the Empirical Methods in Natural Language Processing, pp. 1369-1379, (2018)
[6] Zhen Dong, Ming-Tao Pei, Cross-modality face retrieval based on heterogeneous hashing network, Chinese Journal of Computers, 42, 1, pp. 75-86, (2019)
[7] Shuang-Yong Yan, Chang-Hong Liu, Ai-Wen Jiang, Ji-Hua Ye, Ming-Wen Wang, Discriminative cross-modal hashing with coupled semantic correlation, Chinese Journal of Computers, 42, 1, pp. 164-175, (2019)
[8] Qi-Lu Zhao, Zong-Min Li, Cross-modal social image clustering, Chinese Journal of Computers, 41, 1, pp. 100-113, (2018)
[9] Fan C，, Zhang X，, Zhang S, Wang W，, Zhang C, Huang H., Heterogeneous memory enhanced multimodal attention model for video question answering, Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1999-2007, (2019)
[10] Jiang P, Han Y., Reasoning with heterogeneous graph alignment for video question answering, Proceedings of the Association for the Advance of Artificial Intelligence, pp. 11109-11116, (2020)

← 1 2 3 4 5 6 7 →