Multimodal Deep Neural Network with Image Sequence Features for Video Captioning

被引：0

作者：

Oura, Soichiro ^{[1
]}

Matsukawa, Tetsu ^{[2
]}

Suzuki, Einoshin ^{[2
]}

机构：

[1] Kyushu Univ, Grad Sch Syst Life Sci, Fukuoka, Japan

[2] Kyushu Univ, Dept Informat, ISEE, Fukuoka, Japan

来源：

2018 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN) | 2018年

基金：

日本学术振兴会;

关键词：

LSTM; Multimodal RNN; Video captioning; Image captioning;

D O I：

暂无

中图分类号：

TP18 [人工智能理论];

学科分类号：

081104 ; 0812 ; 0835 ; 1405 ;

摘要：

In this paper, we propose MDNNiSF (Multimodal Deep Neural Network with image Sequence Features) for generating a sentence description of a given video clip. A recently proposed model, S2VT, uses a stack of two LSTMs to solve the problem and demonstrated high METEOR. However, experiments show that S2VT sometimes produces inaccurate sentences, which is quite natural due to the challenging nature of learning relationships between visual and textual contents. A possible reason is that the video caption data were still small for the purpose. We try to circumvent this flaw by integrating S2VT with NeuralTalk2, which is for image captioning and known to generate an accurate description due to its capability of learning alignments between text fragments to image fragments. Experiments using two video caption data, MSVD and MSR-VTT, demonstrate the effectiveness of our MDNNiSF over S2VT. For example, MDNNiSF achieved METEOR 0.344, which is 21.5% higher than S2VT, with MSVD.

引用

页数：7

共 50 条

[31] Image Captioning for Video Surveillance System using Neural Networks
Nivedita, M.
Chandrashekar, Priyanka
Mahapatra, Shibani
Phamila, Y. Asnath Victy
Selvaperumal, Sathish Kumar
[J]. INTERNATIONAL JOURNAL OF IMAGE AND GRAPHICS, 2021, 21 (04)
[32] Multimodal Pretraining for Dense Video Captioning
Huang, Gabriel
Pang, Bo
Zhu, Zhenhai
Rivera, Clara
Soricut, Radu
[J]. 1ST CONFERENCE OF THE ASIA-PACIFIC CHAPTER OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS AND THE 10TH INTERNATIONAL JOINT CONFERENCE ON NATURAL LANGUAGE PROCESSING (AACL-IJCNLP 2020), 2020, : 470 - 490
[33] Multimodal Feature Learning for Video Captioning
Lee, Sujin
Kim, Incheol
[J]. MATHEMATICAL PROBLEMS IN ENGINEERING, 2018, 2018
[34] A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning
Vo, Tham
[J]. SOFT COMPUTING, 2023, 27 (20) : 14647 - 14658
[35] Decoding Image Motion Using Deep Neural Network Features
Shirakawa, Ken U.
Kamitani, Yukiyasu
[J]. I-PERCEPTION, 2019, 10 : 63 - 63
[36] A novel deep fuzzy neural network semantic-enhanced method for automatic image captioning
Tham Vo
[J]. Soft Computing, 2023, 27 : 14647 - 14658
[37] Dual-Stream Recurrent Neural Network for Video Captioning
Xu, Ning
Liu, An-An
Wong, Yongkang
Zhang, Yongdong
Nie, Weizhi
Su, Yuting
Kankanhalli, Mohan
[J]. IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2019, 29 (08) : 2482 - 2493
[38] Video Captioning with Tube Features
Zhao, Bin
Li, Xuelong
Lu, Xiaoqiang
[J]. PROCEEDINGS OF THE TWENTY-SEVENTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2018, : 1177 - 1183
[39] Image Captioning using Convolutional Neural Networks and Recurrent Neural Network
Calvin, Rachel
Suresh, Shravya
[J]. 2021 6TH INTERNATIONAL CONFERENCE FOR CONVERGENCE IN TECHNOLOGY (I2CT), 2021,
[40] MIVCN: Multimodal interaction video captioning network based on semantic association graph
Wang, Ying
Huang, Guoheng
Lin Yuming
Yuan, Haoliang
Pun, Chi-Man
Ling, Wing-Kuen
Cheng, Lianglun
[J]. APPLIED INTELLIGENCE, 2022, 52 (05) : 5241 - 5260

← 1 2 3 4 5 →