Deep multimodal embedding for video captioning

被引：9

作者：

Lee, Jin Young ^{[1
]}

机构：

[1] Sejong Univ, Sch Intelligent Mechatron Engn, Seoul, South Korea

来源：

MULTIMEDIA TOOLS AND APPLICATIONS | 2019年 / 78卷 / 22期

关键词：

Deep embedding; LSTM network; Multimodal features; Video captioning;

D O I：

10.1007/s11042-019-08011-3

中图分类号：

TP [自动化技术、计算机技术];

学科分类号：

0812 ;

摘要：

Automatically generating natural language descriptions from videos, which is simply called video captioning, is very challenging work in computer vision. Thanks to the success of image captioning, in recent years, there has been rapid progress in the video captioning. Unlike images, videos have a variety of modality information, such as frames, motion, audio, and so on. However, since each modality has different characteristic, how they are embedded in a multimodal video captioning network is very important. This paper proposes a deep multimodal embedding network based on analysis of the multimodal features. The experimental results show that the captioning performance of the proposed network is very competitive in comparison with conventional networks.

引用

页码：31793 / 31805

页数：13

共 50 条

[1] Deep multimodal embedding for video captioning
Jin Young Lee
[J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
[2] Multirate Multimodal Video Captioning
Yang, Ziwei
Xu, Youjiang
Wang, Huiyun
Wang, Bo
Han, Yahong
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1877 - 1882
[3] Multimodal Deep Neural Network with Image Sequence Features for Video Captioning
Oura, Soichiro
Matsukawa, Tetsu
Suzuki, Einoshin
[J]. 2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2018,
[4] Multimodal Feature Learning for Video Captioning
Lee, Sujin
Kim, Incheol
[J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[5] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
[J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[6] Early Embedding and Late Reranking for Video Captioning
Dong, Jianfeng
Li, Xirong
Lan, Weiyu
Huo, Yujia
Snoek, Cees G. M.
[J]. MM'16: PROCEEDINGS OF THE 2016 ACM MULTIMEDIA CONFERENCE, 2016, : 1082 - 1086
[7] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
Sun, Liang
Li, Bing
Yuan, Chunfeng
Zha, Zhengjun
Hu, Weiming
[J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
[8] Video Captioning with Guidance of Multimodal Latent Topics
Chen, Shizhe
Chen, Jia
Jin, Qin
Hauptmann, Alexander
[J]. PROCEEDINGS OF THE 2017 ACM MULTIMEDIA CONFERENCE (MM'17), 2017, : 1838 - 1846
[9] Position embedding fusion on transformer for dense video captioning
Yang, Sixuan
Tang, Pengjie
Wang, Hanli
Li, Qinyu
[J]. DEVELOPMENTS OF ARTIFICIAL INTELLIGENCE TECHNOLOGIES IN COMPUTATION AND ROBOTICS, 2020, 12 : 792 - 799
[10] A Deep Structured Model for Video Captioning
Vinodhini, V.
Sathiyabhama, B.
Sankar, S.
Somula, Ramasubbareddy
[J]. INTERNATIONAL JOURNAL OF GAMING AND COMPUTER-MEDIATED SIMULATIONS, 2020, 12 (02) : 44 - 56

← 1 2 3 4 5 →