Multimodal Feature Learning for Video Captioning

被引：6

作者：

Lee, Sujin ^{[1
]}

Kim, Incheol ^{[1
]}

机构：

[1] Kyonggi Univ, Dept Comp Sci, San 94-6, Suwon 443760, South Korea

来源：

MATHEMATICAL PROBLEMS IN ENGINEERING | 2018年 / 2018卷

关键词：

D O I：

10.1155/2018/3125879

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Video captioning refers to the task of generating a natural language sentence that explains the content of the input video clips. This study proposes a deep neural network model for effective video captioning. Apart from visual features, the proposed model learns additionally semantic features that describe the video content effectively. In our model, visual features of the input video are extracted using convolutional neural networks such as C3D and ResNet, while semantic features are obtained using recurrent neural networks such as LSTM. In addition, our model includes an attention-based caption generation network to generate the correct natural language captions based on the multimodal video feature sequences. Various experiments, conducted with the two large benchmark datasets, Microsoft Video Description (MSVD) and Microsoft Research Video-to-Text (MSR-VTT), demonstrate the performance of the proposed model.

引用

页数：8

共 50 条

[21] From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning
Song, Jingkuan
Guo, Yuyu
Gao, Lianli
Li, Xuelong
Hanjalic, Alan
Shen, Heng Tao
[J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2019, 30 (10) : 3047 - 3058
[22] Hierarchical attention-based multimodal fusion for video captioning
Wu, Chunlei
Wei, Yiwei
Chu, Xiaoliang
Weichen, Sun
Su, Fei
Wang, Leiquan
[J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
[23] Multimodal-enhanced hierarchical attention network for video captioning
Maosheng Zhong
Youde Chen
Hao Zhang
Hao Xiong
Zhixiang Wang
[J]. Multimedia Systems, 2023, 29 : 2469 - 2482
[24] Multimodal-enhanced hierarchical attention network for video captioning
Zhong, Maosheng
Chen, Youde
Zhang, Hao
Xiong, Hao
Wang, Zhixiang
[J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
[25] End-to-end Generative Pretraining for Multimodal Video Captioning
Seo, Paul Hongsuck
Nagrani, Arsha
Arnab, Anurag
Schmid, Cordelia
[J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
[26] Multimodal architecture for video captioning with memory networks and an attention mechanism
Li, Wei
Guo, Dashan
Fang, Xiangzhong
[J]. PATTERN RECOGNITION LETTERS, 2018, 105 : 23 - 29
[27] Multi-Task Video Captioning with a Stepwise Multimodal Encoder
Liu, Zihao
Wu, Xiaoyu
Yu, Ying
[J]. ELECTRONICS, 2022, 11 (17)
[28] Learning Video-Text Aligned Representations for Video Captioning
Shi, Yaya
Xu, Haiyang
Yuan, Chunfeng
Li, Bing
Hu, Weiming
Zha, Zheng-Jun
[J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
[29] Video Captioning via Hierarchical Reinforcement Learning
Wang, Xin
Chen, Wenhu
Wu, Jiawei
Wang, Yuan-Fang
Wang, William Yang
[J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4213 - 4222
[30] Learning deep spatiotemporal features for video captioning
Daskalakis, Eleftherios
Tzelepi, Maria
Tefas, Anastasios
[J]. PATTERN RECOGNITION LETTERS, 2018, 116 : 143 - 149

← 1 2 3 4 5 →