Multimodal Feature Learning for Video Captioning

被引:6
|
作者
Lee, Sujin [1 ]
Kim, Incheol [1 ]
机构
[1] Kyonggi Univ, Dept Comp Sci, San 94-6, Suwon 443760, South Korea
关键词
D O I
10.1155/2018/3125879
中图分类号
T [工业技术];
学科分类号
08 ;
摘要
Video captioning refers to the task of generating a natural language sentence that explains the content of the input video clips. This study proposes a deep neural network model for effective video captioning. Apart from visual features, the proposed model learns additionally semantic features that describe the video content effectively. In our model, visual features of the input video are extracted using convolutional neural networks such as C3D and ResNet, while semantic features are obtained using recurrent neural networks such as LSTM. In addition, our model includes an attention-based caption generation network to generate the correct natural language captions based on the multimodal video feature sequences. Various experiments, conducted with the two large benchmark datasets, Microsoft Video Description (MSVD) and Microsoft Research Video-to-Text (MSR-VTT), demonstrate the performance of the proposed model.
引用
收藏
页数:8
相关论文
共 50 条
  • [21] From Deterministic to Generative: Multimodal Stochastic RNNs for Video Captioning
    Song, Jingkuan
    Guo, Yuyu
    Gao, Lianli
    Li, Xuelong
    Hanjalic, Alan
    Shen, Heng Tao
    [J]. IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2019, 30 (10) : 3047 - 3058
  • [22] Hierarchical attention-based multimodal fusion for video captioning
    Wu, Chunlei
    Wei, Yiwei
    Chu, Xiaoliang
    Weichen, Sun
    Su, Fei
    Wang, Leiquan
    [J]. NEUROCOMPUTING, 2018, 315 : 362 - 370
  • [23] Multimodal-enhanced hierarchical attention network for video captioning
    Maosheng Zhong
    Youde Chen
    Hao Zhang
    Hao Xiong
    Zhixiang Wang
    [J]. Multimedia Systems, 2023, 29 : 2469 - 2482
  • [24] Multimodal-enhanced hierarchical attention network for video captioning
    Zhong, Maosheng
    Chen, Youde
    Zhang, Hao
    Xiong, Hao
    Wang, Zhixiang
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (05) : 2469 - 2482
  • [25] End-to-end Generative Pretraining for Multimodal Video Captioning
    Seo, Paul Hongsuck
    Nagrani, Arsha
    Arnab, Anurag
    Schmid, Cordelia
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17938 - 17947
  • [26] Multimodal architecture for video captioning with memory networks and an attention mechanism
    Li, Wei
    Guo, Dashan
    Fang, Xiangzhong
    [J]. PATTERN RECOGNITION LETTERS, 2018, 105 : 23 - 29
  • [27] Multi-Task Video Captioning with a Stepwise Multimodal Encoder
    Liu, Zihao
    Wu, Xiaoyu
    Yu, Ying
    [J]. ELECTRONICS, 2022, 11 (17)
  • [28] Learning Video-Text Aligned Representations for Video Captioning
    Shi, Yaya
    Xu, Haiyang
    Yuan, Chunfeng
    Li, Bing
    Hu, Weiming
    Zha, Zheng-Jun
    [J]. ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2023, 19 (02)
  • [29] Video Captioning via Hierarchical Reinforcement Learning
    Wang, Xin
    Chen, Wenhu
    Wu, Jiawei
    Wang, Yuan-Fang
    Wang, William Yang
    [J]. 2018 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2018, : 4213 - 4222
  • [30] Learning deep spatiotemporal features for video captioning
    Daskalakis, Eleftherios
    Tzelepi, Maria
    Tefas, Anastasios
    [J]. PATTERN RECOGNITION LETTERS, 2018, 116 : 143 - 149