Multimodal Feature Learning for Video Captioning

被引：6

作者：

Lee, Sujin ^{[1
]}

Kim, Incheol ^{[1
]}

机构：

[1] Kyonggi Univ, Dept Comp Sci, San 94-6, Suwon 443760, South Korea

来源：

MATHEMATICAL PROBLEMS IN ENGINEERING | 2018年 / 2018卷

关键词：

D O I：

10.1155/2018/3125879

中图分类号：

T [工业技术];

学科分类号：

08 ;

摘要：

Video captioning refers to the task of generating a natural language sentence that explains the content of the input video clips. This study proposes a deep neural network model for effective video captioning. Apart from visual features, the proposed model learns additionally semantic features that describe the video content effectively. In our model, visual features of the input video are extracted using convolutional neural networks such as C3D and ResNet, while semantic features are obtained using recurrent neural networks such as LSTM. In addition, our model includes an attention-based caption generation network to generate the correct natural language captions based on the multimodal video feature sequences. Various experiments, conducted with the two large benchmark datasets, Microsoft Video Description (MSVD) and Microsoft Research Video-to-Text (MSR-VTT), demonstrate the performance of the proposed model.

引用

页数：8

共 50 条

[1] Multimodal feature fusion based on object relation for video captioning
Yan, Zhiwen
Chen, Ying
Song, Jinlong
Zhu, Jia
[J]. CAAI TRANSACTIONS ON INTELLIGENCE TECHNOLOGY, 2023, 8 (01) : 247 - 259
[2] Concept Parser With Multimodal Graph Learning for Video Captioning
Wu, Bofeng
Liu, Buyu
Huang, Peng
Bao, Jun
Peng, Xi
Yu, Jun
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2023, 33 (09) : 4484 - 4495
[3] Learning Multimodal Attention LSTM Networks for Video Captioning
Xu, Jun
Yao, Ting
Zhang, Yongdong
Mei, Tao
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 537 - 545
[4] Multirate Multimodal Video Captioning
Yang, Ziwei
Xu, Youjiang
Wang, Huiyun
Wang, Bo
Han, Yahong
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882
[5] Deep multimodal embedding for video captioning
Jin Young Lee
[J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[6] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
[J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[7] Deep multimodal embedding for video captioning
Lee, Jin Young
[J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
[8] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[9] Video Captioning with Guidance of Multimodal Latent Topics
Chen, Shizhe
Chen, Jia
Jin, Qin
Hauptmann, Alexander
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
[10] Temporal Attention Feature Encoding for Video Captioning
Kim, Nayoung
Ha, Seong Jong
Kang, Je-Won
[J]. 2020 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2020, : 1279 - 1282

← 1 2 3 4 5 →