Combine Visual Features and Scene Semantics for Image Captioning

被引:0
|
作者
Li Z.-X. [1 ]
Wei H.-Y. [1 ]
Huang F.-C. [1 ]
Zhang C.-L. [1 ]
Ma H.-F. [1 ,2 ]
Shi Z.-Z. [3 ]
机构
[1] Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin
[2] College of Computer Science and Engineering, Northwest Normal University, Lanzhou
[3] Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing
来源
Li, Zhi-Xin (lizx@gxnu.edu.cn) | 1624年 / Science Press卷 / 43期
基金
中国国家自然科学基金;
关键词
Attention mechanism; Encoder-decoder framework; Image captioning; Reinforcement learning; Scene semantics;
D O I
10.11897/SP.J.1016.2020.01624
中图分类号
学科分类号
摘要
Most of the existing image captioning methods only use the visual information of the image to guide the generation of the captions, lacking the guidance of effective scene semantic information. In addition, the current visual attention mechanism cannot adjust the focus intensity on the image effectively. In order to solve these problems, this paper firstly proposes an improved visual attention model, which introduces a focus intensity coefficient so as to adjust attention intensity automatically. Specifically, the focus intensity coefficient of the attention mechanism is a learnable scaling factor. It can be calculated by the image information and the context information of the model at each time step of the language model decoding procedure. When using the attention mechanism to calculate the attention weight distribution on the image, the "soft" or "hard" intensity of attention mechanism can be adjusted automatically by adaptively scaling the input value of softmax function through the focus intensity coefficient. Then the concentration and dispersion of the visual attention can be achieved. Therefore, the proposed attention model can make the extracted image visual information more accurate. Furthermore, we combine unsupervised and supervised learning methods to extract a series of topic words related to the image scene to represent scene semantic information of the image, which is added to the language model to guide the generation of captions. We believe that each image contains several scene topic concepts, and each topic concept can be represented by some topic words. Specifically, we use the latent Direchlet allocation (LDA) model to cluster all the caption texts in the dataset. Then the topic category of the caption text is used to represent the scene category of corresponding image. What is more, we train a multi-layer perceptron (MLP) to classify the image into topic concepts. As a result, each topic category is represented by a series of topic words obtained from clustering. Then the scene semantic information of each image can be represented by these topic words, which are very relevant to the image scene. We add these topic words to the language model so that it can obtain more prior knowledge. Since the topic information of the image scene is obtained through analyzing the captions, it contains some global information of the captions to be generated. Therefore, our model can predict some important words that suitable for image scene. Finally, we use the attention mechanism to determine the visual information of the image and the semantic information of the scene that the model pays attention to at each time step of the decoding procedure, and use the gating mechanism to control the proportion of the input of these two information. Afterwards, both information is combined to guide the model to generate more accurate and scene-specific captions. In the experimental section, we evaluate our model on two standard datasets, i.e. MSCOCO and Flickr30k. The experimental results show that our approach can generate more accurate captions than many state-of-the-art approaches. In addition, compared with the baseline approach, our approach achieves about 3% improvement on overall evaluation metrics. © 2020, Science Press. All right reserved.
引用
收藏
页码:1624 / 1640
页数:16
相关论文
共 33 条
  • [1] Mao J, Xu W, Yang Y, Et al., Deep captioning with multimodal recurrent neural networks (m-RNN), (2014)
  • [2] Vinyals O, Toshev A, Bengio S, Et al., Show and tell: A neural image caption generator, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156-3164, (2015)
  • [3] Karpathy A, Fei-Fei Li, Deep visual-semantic alignments for generating image descriptions, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3128-3137, (2015)
  • [4] Cho K, Van Merrienboer B, Gulcehre C, Et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, (2014)
  • [5] Bahdanau D, Cho K, Bengio Y., Neural machine translation by jointly learning to align and translate, (2014)
  • [6] Sutskever I, Vinyals O, Le Q V., Sequence to sequence learning with neural networks, Advances in Neural Information Processing Systems (NIPS), pp. 3104-3112, (2014)
  • [7] Xu K, Ba J, Kiros R, Et al., Show, attend and tell: Neural image caption generation with visual attention, Proceedings of the International Conference on Machine Learning (ICML), pp. 2048-2057, (2015)
  • [8] Chen L, Zhang H, Xiao J, Et al., SCA-CNN: Spatial and channel-wise attention in convolutional networks for image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6298-6306, (2017)
  • [9] Anderson P, He X, Buehler C, Et al., Bottom-up and top-down attention for image captioning and visual question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6077-6086, (2018)
  • [10] Lu J, Yang J, Batra D, Et al., Neural baby talk, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7219-7228, (2018)