Video Question Answering Scheme Base on Multimodal Knowledge Active Learning

被引：0

作者：

Liu M. ^{[1
]}

Wang R. ^{[1
]}

Zhou F. ^{[1
]}

Lin G. ^{[1
]}

机构：

[1] National Engineering Research Center of Digital Life, School of Comрuter Science and Engineering, Sun Yat-sen University, Guangzhou

来源：

Jisuanji Yanjiu yu Fazhan/Computer Research and Development | 2024年 / 61卷 / 04期

关键词：

data fusion and reasoning; deep learning; multimodal active learning; video details description extraction; video question answering;

D O I：

10.7544/issn1000-1239.202221008

中图分类号：

学科分类号：

摘要：

Video question answering requires models to understand, fuse, and reason about the multimodal data in videos to assist people in quickly retrieving, analyzing, and summarizing complex scenes in videos, becoming a hot research topic in artificial intelligence. However, existing methods lack abilities of obtaining the motion details of visual objects in feature extraction, which may lead to false causality. In addition, in data fusion and reasoning, existing methods lack effective active learning ability, making it difficult to obtain prior knowledge beyond feature extraction, which affects the model’s deep understanding of multimodal content. To address these issues, we propose a multimodal knowledge-based active learning video question answering solution. The solution acquires the semantic correlation of visual targets in image sequences and the dynamic relationship with the surrounding environment to establish the motion trajectory of each visual target. Further, static content is supplemented with dynamic content to provide more accurate video feature expression for data fusion and reasoning. Then, the solution achieves self-improvement and logical thinking focus of multimodal information understanding through knowledge auto-enhancement multimodal data fusion and reasoning model, filling the gap in deep understanding of multimodal content. Experimental results show that the performance of our scheme is better than the most advanced video question answering algorithm, and a large number of ablation and visualization experiments also verify the rationality of this solution. © 2024 Science Press. All rights reserved.

引用

页码：889 / 902

页数：13

共 31 条

[1] Jun Yu, Liang Wang, Zhou Yu, Research on visual question answering techniques[J], Journal of Computer Research and Development, 55, 9, (2018)
[2] Lu Zhang, Feng Cao, Xinyan Liang, Et al., Cross-modal retrieval with correlation feature propagation[J], Journal of Computer Research and Development, 59, 9, (2022)
[3] Zhixin Li, Haiyang Wei, Canlong Zhang, Et al., Research progress on image captioning[J], Journal of Computer Research and Development, 58, 9, pp. 1951-1974, (2021)
[4] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Et al., Deep residual learning for image recognition[C], Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition, pp. 770-778, (2016)
[5] Hara K, Kataoka H, Satoh Y., Can spatiotemporal 3D CNNs retrace the history of 2D CNNs and imageNet[C], Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition, pp. 6546-6555, (2018)
[6] Peter A, Xiaodong He, Buehler C, Et al., Bottom-up and top-down attention for image captioning and visual question answering[C], Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition, pp. 6077-6086, (2018)
[7] Jiasen Lu, Yang Jianwei, Batra D, Et al., Hierarchical question-image co-attention for visual question answering[C], Proc of the 30th Int Conf on Neural Information Proc Systems, pp. 289-297, (2016)
[8] Jiyang Gao, Ge Runzhou, Chen Kan, Et al., Motion appearance co-memory networks for video question answering[C], Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition, pp. 6576-6585, (2018)
[9] Dang L H, Le T, Le V, Et al., Hierarchical object-oriented spatiotemporal reasoning for video question answering[C], Proc of the 30th Int Joint Conf on Artificial Intelligence, pp. 636-642, (2021)
[10] Jiang Jianwen, Chen Ziqiang, Lin Haojie, Et al., Divide and conquer: Question-guided spatio-temporal conrmual attention for video question answering[C], Proc of the 34th AAAI Conf on Artificial Intelligence, pp. 11101-11108, (2020)

← 1 2 3 4 →