Multimodal Deep Neural Network with Image Sequence Features for Video Captioning

被引:0
|
作者
Oura, Soichiro [1 ]
Matsukawa, Tetsu [2 ]
Suzuki, Einoshin [2 ]
机构
[1] Kyushu Univ, Grad Sch Syst Life Sci, Fukuoka, Japan
[2] Kyushu Univ, Dept Informat, ISEE, Fukuoka, Japan
基金
日本学术振兴会;
关键词
LSTM; Multimodal RNN; Video captioning; Image captioning;
D O I
暂无
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSR-VTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.
引用
收藏
页数:7
相关论文
共 50 条
  • [31] Image Captioning for Video Surveillance System using Neural Networks
    Nivedita, M.
    Chandrashekar, Priyanka
    Mahapatra, Shibani
    Phamila, Y. Asnath Victy
    Selvaperumal, Sathish Kumar
    [J]. INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2021, 21 (04)
  • [32] Multimodal Pretraining for Dense Video Captioning
    Huang, Gabriel
    Pang, Bo
    Zhu, Zhenhai
    Rivera, Clara
    Soricut, Radu
    [J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
  • [33] Multimodal Feature Learning for Video Captioning
    Lee, Sujin
    Kim, Incheol
    [J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
  • [34] A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning
    Vo, Tham
    [J]. SOFT COMPUTING, 2023, 27 (20) : 14647 - 14658
  • [35] Decoding Image Motion Using Deep Neural Network Features
    Shirakawa, Ken U.
    Kamitani, Yukiyasu
    [J]. I-PERCEPTION, 2019, 10 : 63 - 63
  • [36] A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning
    Tham Vo
    [J]. Soft Computing, 2023, 27 : 14647 - 14658
  • [37] Dual-Stream Recurrent Neural Network for Video Captioning
    Xu, Ning
    Liu, An-An
    Wong, Yongkang
    Zhang, Yongdong
    Nie, Weizhi
    Su, Yuting
    Kankanhalli, Mohan
    [J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (08) : 2482 - 2493
  • [38] Video Captioning with Tube Features
    Zhao, Bin
    Li, Xuelong
    Lu, Xiaoqiang
    [J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1177 - 1183
  • [39] Image Captioning using Convolutional Neural Networks and Recurrent Neural Network
    Calvin, Rachel
    Suresh, Shravya
    [J]. 2021 6TH INTERNATIONAL CONFERENCE FOR CONVERGENCE IN TECHNOLOGY (I2CT), 2021,
  • [40] MIVCN: Multimodal interaction video captioning network based on semantic association graph
    Wang, Ying
    Huang, Guoheng
    Lin Yuming
    Yuan, Haoliang
    Pun, Chi-Man
    Ling, Wing-Kuen
    Cheng, Lianglun
    [J]. APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260