Combine Visual Features and Scene Semantics for Image Captioning

被引：0

作者：

Li Z.-X. ^{[1
]}

Wei H.-Y. ^{[1
]}

Huang F.-C. ^{[1
]}

Zhang C.-L. ^{[1
]}

Ma H.-F. ^{[1
,2
]}

Shi Z.-Z. ^{[3
]}

机构：

[1] Guangxi Key Lab of Multi-source Information Mining & Security, Guangxi Normal University, Guilin

[2] College of Computer Science and Engineering, Northwest Normal University, Lanzhou

[3] Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, Beijing

来源：

Li, Zhi-Xin (lizx@gxnu.edu.cn) | 1624年 / Science Press卷 / 43期

基金：

中国国家自然科学基金;

关键词：

Attention mechanism; Encoder-decoder framework; Image captioning; Reinforcement learning; Scene semantics;

D O I：

10.11897/SP.J.1016.2020.01624

中图分类号：

学科分类号：

摘要：

Most of the existing image captioning methods only use the visual information of the image to guide the generation of the captions, lacking the guidance of effective scene semantic information. In addition, the current visual attention mechanism cannot adjust the focus intensity on the image effectively. In order to solve these problems, this paper firstly proposes an improved visual attention model, which introduces a focus intensity coefficient so as to adjust attention intensity automatically. Specifically, the focus intensity coefficient of the attention mechanism is a learnable scaling factor. It can be calculated by the image information and the context information of the model at each time step of the language model decoding procedure. When using the attention mechanism to calculate the attention weight distribution on the image, the "soft" or "hard" intensity of attention mechanism can be adjusted automatically by adaptively scaling the input value of softmax function through the focus intensity coefficient. Then the concentration and dispersion of the visual attention can be achieved. Therefore, the proposed attention model can make the extracted image visual information more accurate. Furthermore, we combine unsupervised and supervised learning methods to extract a series of topic words related to the image scene to represent scene semantic information of the image, which is added to the language model to guide the generation of captions. We believe that each image contains several scene topic concepts, and each topic concept can be represented by some topic words. Specifically, we use the latent Direchlet allocation (LDA) model to cluster all the caption texts in the dataset. Then the topic category of the caption text is used to represent the scene category of corresponding image. What is more, we train a multi-layer perceptron (MLP) to classify the image into topic concepts. As a result, each topic category is represented by a series of topic words obtained from clustering. Then the scene semantic information of each image can be represented by these topic words, which are very relevant to the image scene. We add these topic words to the language model so that it can obtain more prior knowledge. Since the topic information of the image scene is obtained through analyzing the captions, it contains some global information of the captions to be generated. Therefore, our model can predict some important words that suitable for image scene. Finally, we use the attention mechanism to determine the visual information of the image and the semantic information of the scene that the model pays attention to at each time step of the decoding procedure, and use the gating mechanism to control the proportion of the input of these two information. Afterwards, both information is combined to guide the model to generate more accurate and scene-specific captions. In the experimental section, we evaluate our model on two standard datasets, i.e. MSCOCO and Flickr30k. The experimental results show that our approach can generate more accurate captions than many state-of-the-art approaches. In addition, compared with the baseline approach, our approach achieves about 3% improvement on overall evaluation metrics. © 2020, Science Press. All right reserved.

引用

页码：1624 / 1640

页数：16

共 33 条

[11] Blei D M, Ng A Y, Jordan M I., Latent dirichlet allocation, Journal of Machine Learning Research, 3, 1, pp. 993-1022, (2003)
[12] Lu J, Xiong C, Parikh D, Et al., Knowing when to look: Adaptive attention via a visual sentinel for image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3242-3250, (2017)
[13] You Q, Jin H, Wang Z, Et al., Image captioning with semantic attention, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651-4659, (2016)
[14] Dai J, Li Y, He K, Et al., R-FCN: Object detection via region-based fully convolutional networks, Advances in Neural Information Processing Systems (NIPS), pp. 379-387, (2016)
[15] Gu J, Cai J, Wang G, Et al., Stack-captioning: Coarse-to-fine learning for image captioning, Proceedings of 32nd AAAI Conference on Artificial Intelligence (AAAI), pp. 6837-6844, (2018)
[16] Jiang W, Ma L, Jiang Y G, Et al., Recurrent fusion network for image captioning, Proceedings of the European Conference on Computer Vision (ECCV), pp. 499-515, (2018)
[17] Goodfellow I, Pouget-Abadie J, Mirza M, Et al., Generative adversarial nets, Advances in Neural Information Processing Systems (NIPS), pp. 2672-2680, (2014)
[18] Ranzato M A, Chopra S, Auli M, Et al., Sequence level training with recurrent neural networks, (2015)
[19] Rennie S J, Marcheret E, Mroueh Y, Et al., Self-critical sequence training for image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1179-1195, (2017)
[20] Dai B, Fidler S, Urtasun R, Et al., Towards diverse and natural image descriptions via a conditional GAN, Proceedings of the IEEE International Conference on Computer Vision (CVPR), pp. 2970-2979, (2017)

← 1 2 3 4 →