Salient Feature Extraction Mechanism for Image Captioning

被引:0
|
作者
Wang X. [1 ]
Song Y.-H. [2 ]
Zhang Y.-L. [2 ]
机构
[1] School of Software Engineering, Xi'an Jiaotong University, Xi'an
[2] College of Artificial Inteligence, Xi'an Jiaotong University, Xi'an
来源
关键词
Decoder; Encoder; Image captioning; Language model; Salient feature extract;
D O I
10.16383/j.aas.c190279
中图分类号
学科分类号
摘要
Image captioning is a research direction that combines computer vision and natural language processing. In this paper, a novel saliency feature extraction mechanism (SFEM) is designed to solve several key problems existing in current methods. It can quickly provide the most valuable visual features to the language model before which predict word. And it effectively solves the problems that the existing methods are inaccurate in selecting visual features and time-consuming. SFEM consists of global salient feature extractor and instant salient feature extractor: global salient Feature extractor extracts salient visual features from multiple local visual vectors and integrate these features into a global salient visual vector; the instant salient feature extractor can extract the saliency visual features required at each moment from the global saliency visual vector according to the needs of the language model. We evaluated SFEM on the MS COCO (Microsoft common objects in context) dataset. Experiments show that our SFEM can significantly improve the accuracy of baseline in caption generating. And SFEM is significantly better than the widely used spatial attention model in both the accuracy of generating caption and time performance. Copyright ©2022 Acta Automatica Sinica. All rights reserved.
引用
收藏
页码:735 / 746
页数:11
相关论文
共 31 条
  • [1] Kulkarni G, Premraj V, Ordonez V, Dhar S, Li S M, Choi Y, Et al., BabyTalk: Understanding and generating simple image descriptions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 12, pp. 2891-2903, (2013)
  • [2] Mao J H, Xu W, Yang Y, Wang J, Yuille A L., Deep captioning with multimodal recurrent neural networks (m-RNN), Proceedings of the 3rd International Conference on Learning Representations, (2015)
  • [3] Tang Peng-Jie, Wang Han-Li, Xu Kai-Sheng, Multi-objective layer-wise optimization and multi-level probability fusion for image description generation using LSTM, Acta Automatica Sinica, 44, 7, pp. 1237-1249, (2018)
  • [4] Cho K, Van Merrienboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Et al., Learning phrase representations using RNN encoder-decoder for statistical machine translation, (2014)
  • [5] Bahdanau D, Cho K, Bengio Y., Neural machine translation by jointly learning to align and translate, Proceedings of the 3rd International Conference on Learning Representations, (2015)
  • [6] Sutskever I, Vinyals O, Le Q V., Sequence to sequence learning with neural networks, Proceedings of the 27th International Conference on Neural Information Processing Systems, (2014)
  • [7] Vinyals O, Toshev A, Bengio S, Erhan D., Show and tell: A neural image caption generator, Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3156-3164, (2015)
  • [8] Zhang Xue-Song, Zhuang Yan, Yan Fei, Wang Wei, Status and development of transfer learning based category-level object recognition and detection, Acta Automatica Sinica, 45, 7, pp. 1224-1243, (2019)
  • [9] You Q Z, Jin H L, Wang Z W, Fang C, Luo J B., Image captioning with semantic attention, Proceedings of the 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4651-4659, (2016)
  • [10] Hochreiter S, Schmidhuber J., Long short-term memory, Neural Computation, 9, 8, pp. 1735-1780, (1997)