CNN image caption generation

被引:0
|
作者
Li Y. [1 ,2 ,3 ]
Cheng H. [1 ,2 ,3 ]
Liang X. [1 ,2 ,3 ]
Guo Q. [1 ,2 ,3 ]
Qian Y. [1 ,2 ,3 ]
机构
[1] Research Institute of Big Data Science and Industry, Shanxi University, Taiyuan
[2] Key Lab. of Computational Intelligence and Chinese Information Processing of Ministry of Education, Shanxi University, Taiyuan
[3] School of Computer and Information Technology, Shanxi University, Taiyuan
关键词
Image caption; Long short term memory; Multi-modal data; Neural networks;
D O I
10.19665/j.issn1001-2400.2019.02.025
中图分类号
学科分类号
摘要
The image caption generation task needs to generate a meaningful sentence which can accurately describe the content of the image. Existing research usually uses the convolutional neural network to encode image information and the recurrent neural network to encode text information, due to the "serial character" of the recurrent neural network which result in the low performance. In order to solve this problem, the model we proposed is completely based on the convolutional neural network. We use different convolutional neural networks to process the data of two modals simultaneously. Benefiting from the "parallel character" of convolution operation, the efficiency of the operation has been significantly improved, and experiments have been carried out on two public data sets. Experimental results have also been improved in the specified evaluation indexes, which indicates the effectiveness of the model for processing the image caption generation task. © 2019, The Editorial Board of Journal of Xidian University. All right reserved.
引用
收藏
页码:152 / 157
页数:5
相关论文
共 9 条
  • [1] Vinyals O., Toshev A., Bengio S., Et al., Show and Tell: a Neural Image Caption Generator, Proceedings of the 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3156-3164, (2015)
  • [2] Szegedy C., Liu W., Jia Y.Q., Et al., Going Deeper with Convolutions, Proceedings of the 2015 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 1-9, (2015)
  • [3] Karpathy A., Li F.F., Deep Visual-semantic Alignments for Generating Image Descriptions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 4, pp. 664-676, (2017)
  • [4] Xu Q., Li W., Zhan R., Et al., Improved Algorithm for SAR Target Recognition Based on the Convolutional Neural Network, Journal of Xidian University, 45, 5, pp. 177-183, (2018)
  • [5] Lu J., Xiong C., Parikh D., Et al., Knowing When to Look: Adaptive Attention via a Visual Sentinel for Image Captioning, Proceedings of the 201730th IEEE Conference on Computer Vision and Pattern Recognition, pp. 3242-3250, (2017)
  • [6] Wang Y., Lin Z., Shen X., Et al., Skeleton Key: Image Captioning by Skeleton-attribute Decomposition, Proceedings of the 201730th IEEE Conference on Computer Vision and Pattern Recognition, pp. 7378-7387, (2017)
  • [7] Zhang J., Fang Y., Yuan H., Et al., Multiple Convolutional Neural Networks for Facial Expression Sequence Recognition, Journal of Xidian University, 45, 1, pp. 150-155, (2018)
  • [8] Santoro A., Raposo D., Barrett D.G.T., Et al., A Simple Neural Network Module for Relational Reasoning, Advances in Neural Information Processing Systems, pp. 4968-4977, (2017)
  • [9] Ren Z., Wang X., Zhang N., Et al., Deep Reinforcement Learning-based Image Captioning with Embedding Reward, Proceedings of the 201730th IEEE Conference on Computer Vision and Pattern Recognition, pp. 1151-1159, (2017)