Deep multimodal embedding for video captioning

被引:9
|
作者
Lee, Jin Young [1 ]
机构
[1] Sejong Univ, Sch Intelligent Mechatron Engn, Seoul, South Korea
关键词
Deep embedding; LSTM network; Multimodal features; Video captioning;
D O I
10.1007/s11042-019-08011-3
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Automatically generating natural language descriptions from videos, which is simply called video captioning, is very challenging work in computer vision. Thanks to the success of image captioning, in recent years, there has been rapid progress in the video captioning. Unlike images, videos have a variety of modality information, such as frames, motion, audio, and so on. However, since each modality has different characteristic, how they are embedded in a multimodal video captioning network is very important. This paper proposes a deep multimodal embedding network based on analysis of the multimodal features. The experimental results show that the captioning performance of the proposed network is very competitive in comparison with conventional networks.
引用
收藏
页码:31793 / 31805
页数:13
相关论文
共 50 条
  • [31] Image and Video Captioning for Apparels Using Deep Learning
    Agarwal, Govind
    Jindal, Kritika
    Chowdhury, Abishi
    Singh, Vishal K.
    Pal, Amrit
    [J]. IEEE ACCESS, 2024, 12 : 113138 - 113150
  • [32] Stacked Multimodal Attention Network for Context-Aware Video Captioning
    Zheng, Yi
    Zhang, Yuejie
    Feng, Rui
    Zhang, Tao
    Fan, Weiguo
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2022, 32 (01) : 31 - 42
  • [33] MAPS: Joint Multimodal Attention and POS Sequence Generation for Video Captioning
    Zou, Cong
    Wang, Xuchen
    Hu, Yaosi
    Chen, Zhenzhong
    Liu, Shan
    [J]. 2021 INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2021,
  • [34] Deep Learning based, a New Model for Video Captioning
    Ozer, Elif Gusta
    Karapinar, Ilteber Nur
    Busbug, Sena
    Turan, Sumeyye
    Utku, Anil
    Akcayol, M. Ali
    [J]. INTERNATIONAL JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 2020, 11 (03) : 514 - 519
  • [35] Deep Reinforcement Learning-based Image Captioning with Embedding Reward
    Ren, Zhou
    Wang, Xiaoyu
    Zhang, Ning
    Lv, Xutao
    Li, Li-Jia
    [J]. 30TH IEEE CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2017), 2017, : 1151 - 1159
  • [36] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
    Shuqin Chen
    Xian Zhong
    Lin Li
    Wenxuan Liu
    Cheng Gu
    Luo Zhong
    [J]. Neural Processing Letters, 2020, 52 : 2353 - 2369
  • [37] Adaptively Converting Auxiliary Attributes and Textual Embedding for Video Captioning Based on BiLSTM
    Chen, Shuqin
    Zhong, Xian
    Li, Lin
    Liu, Wenxuan
    Gu, Cheng
    Zhong, Luo
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (03) : 2353 - 2369
  • [38] Impact of Video Compression and Multimodal Embedding on Scene Description
    Lee, Jin Young
    [J]. ELECTRONICS, 2019, 8 (09)
  • [39] Semantic Embedding Guided Attention with Explicit Visual Feature Fusion for Video Captioning
    Dong, Shanshan
    Niu, Tianzi
    Luo, Xin
    Liu, Wu
    Xu, Xinshun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [40] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Wang, Ying
    Huang, Guoheng
    Lin Yuming
    Yuan, Haoliang
    Pun, Chi-Man
    Ling, Wing-Kuen
    Cheng, Lianglun
    [J]. APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260