Multimodal architecture for video captioning with memory networks and an attention mechanism

被引：26

作者：

Li, Wei ^{[1
]}

Guo, Dashan ^{[1
]}

Fang, Xiangzhong ^{[1
]}

机构：

[1] Shanghai Jiao Tong Univ, Dept Elect Engn, Shanghai 200240, Peoples R China

来源：

PATTERN RECOGNITION LETTERS | 2018年 / 105卷

关键词：

Video captioning; Memory network; Attention mechanism;

D O I：

10.1016/j.patrec.2017.10.012

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

Automatically describing videos containing rich and open-domain activities is a very challenging task for computer vision and machine learning research. Obviously, accurate descriptions of video contents need the understanding of both visual concepts and their temporal dynamics. A lot of efforts have been made to understand visual concepts in still image tasks, e.g., image classification and object detection. However, the combination of visual concepts and temporal dynamics has not been given sufficient attention. To delve deeper into the unique characteristic of videos, we propose a novel video captioning architecture to integrate both visual concepts and temporal dynamics. In this paper, an attention mechanism and memory networks are combined together into our multimodal framework with a feature selection algorithm. Specially, we utilize the soft attention mechanism to choose visual concepts relevant frames based on previously generated words, and the memorization of temporal dynamics is implemented by the memory networks, which have great advantages of memorizing long-term information. Then the visual concepts and the temporal dynamics are integrated together into our multimodal architecture. Moreover, the feature selection algorithm is applied to select more relevant features between them according to the part of speech. Finally, we test our proposed framework on both MSVD and MSR-VTT datasets and achieve competitive performance compared with other state-of-the-art methods. (c) 2017Elsevier B.V. Allrightsreserved.

引用

页码：23 / 29

页数：7

共 50 条

[1] Learning Multimodal Attention LSTM Networks for Video Captioning
Xu, Jun
Yao, Ting
Zhang, Yongdong
Mei, Tao
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
[2] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[3] Multimodal attention-based transformer for video captioning
Hemalatha Munusamy
Chandra Sekhar C
[J]. Applied Intelligence, 2023, 53 : 23349 - 23368
[4] Multimodal attention-based transformer for video captioning
Munusamy, Hemalatha
Sekhar, C. Chandra
[J]. APPLIED INTELLIGENCE, 2023, 53 (20) : 23349 - 23368
[5] Hierarchical attention-based multimodal fusion for video captioning
Wu, Chunlei
Wei, Yiwei
Chu, Xiaoliang
Weichen, Sun
Su, Fei
Wang, Leiquan
[J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
[6] Multimodal-enhanced hierarchical attention network for video captioning
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
[J]. Multimedia Systems, 2023, 29 : 2469 - 2482
[7] Multimodal-enhanced hierarchical attention network for video captioning
Zhong, Maosheng
Chen, Youde
Zhang, Hao
Xiong, Hao
Wang, Zhixiang
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
[8] M3: Multimodal Memory Modelling for Video Captioning
Wang, Junbo
Wang, Wei
Huang, Yan
Wang, Liang
Tan, Tieniu
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 7512 - 7520
[9] Stacked Multimodal Attention Network for Context-Aware Video Captioning
Zheng, Yi
Zhang, Yuejie
Feng, Rui
Zhang, Tao
Fan, Weiguo
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
[10] MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning
Zou, Cong
Wang, Xuchen
Hu, Yaosi
Chen, Zhenzhong
Liu, Shan
[J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,

← 1 2 3 4 5 →