Multimodal Deep Neural Network with Image Sequence Features for Video Captioning

被引:0
|
作者
Oura, Soichiro [1 ]
Matsukawa, Tetsu [2 ]
Suzuki, Einoshin [2 ]
机构
[1] Kyushu Univ, Grad Sch Syst Life Sci, Fukuoka, Japan
[2] Kyushu Univ, Dept Informat, ISEE, Fukuoka, Japan
基金
日本学术振兴会;
关键词
LSTM; Multimodal RNN; Video captioning; Image captioning;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSR-VTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.
引用
收藏
页数:7
相关论文
共 50 条
  • [1] Multimodal graph neural network for video procedural captioning
    Ji, Lei
    Tu, Rongcheng
    Lin, Kevin
    Wang, Lijuan
    Duan, Nan
    [J]. NEUROCOMPUTING, 2022, 488 : 88 - 96
  • [2] Hierarchical Deep Neural Network for Image Captioning
    Su, Yuting
    Li, Yuqian
    Xu, Ning
    Liu, An-An
    [J]. NEURAL PROCESSING LETTERS, 2020, 52 (02) : 1057 - 1067
  • [3] Hierarchical Deep Neural Network for Image Captioning
    Yuting Su
    Yuqian Li
    Ning Xu
    An-An Liu
    [J]. Neural Processing Letters, 2020, 52 : 1057 - 1067
  • [4] Deep multimodal embedding for video captioning
    Jin Young Lee
    [J]. Multimedia Tools and Applications, 2019, 78 : 31793 - 31805
  • [5] Deep multimodal embedding for video captioning
    Lee, Jin Young
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2019, 78 (22) : 31793 - 31805
  • [6] A Hierarchical Multimodal Attention-based Neural Network for Image Captioning
    Cheng, Yong
    Huang, Fei
    Zhou, Lian
    Jin, Cheng
    Zhang, Yuejie
    Zhang, Tao
    [J]. SIGIR'17: PROCEEDINGS OF THE 40TH INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL, 2017, : 889 - 892
  • [7] A survey on deep neural network-based image captioning
    Xiaoxiao Liu
    Qingyang Xu
    Ning Wang
    [J]. The Visual Computer, 2019, 35 : 445 - 470
  • [8] MULTIMODAL SEMANTIC ATTENTION NETWORK FOR VIDEO CAPTIONING
    Sun, Liang
    Li, Bing
    Yuan, Chunfeng
    Zha, Zhengjun
    Hu, Weiming
    [J]. 2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2019, : 1300 - 1305
  • [9] A survey on deep neural network-based image captioning
    Liu, Xiaoxiao
    Xu, Qingyang
    Wang, Ning
    [J]. VISUAL COMPUTER, 2019, 35 (03): : 445 - 470
  • [10] Learning deep spatiotemporal features for video captioning
    Daskalakis, Eleftherios
    Tzelepi, Maria
    Tefas, Anastasios
    [J]. PATTERN RECOGNITION LETTERS, 2018, 116 : 143 - 149