Overview of Image Captions Based on Deep Learning

被引:0
|
作者
Shi Y.-L. [1 ]
Yang W.-Z. [2 ]
Du H.-X. [1 ]
Wang L.-H. [1 ]
Wang T. [1 ]
Li S.-S. [1 ]
机构
[1] Key Laboratory of Software Engineering Technology, Xinjiang University, Urumqi
[2] School of Information Science and Engineering, Xinjiang University, Urumqi
来源
关键词
Attention mechanism; Encoder-decoder framework; Intelligence-image understanding; Reinforcement learning;
D O I
10.12263/DZXB.20200669
中图分类号
学科分类号
摘要
Image caption aims to extract the features of the image and input the description of the final output image into the language generation model, which solves the intersection of natural language processing and computer vision in artificial intelligence-image understanding. Summarize and analyze representative thesis of image description orientation from 2015 to 2020, different core technologies as classification criteria, it can be roughly divided into: image caption based on Encoder-Decoder framework, image caption based on attention mechanism, image caption based on reinforcement learning, image caption based on Generative Adversarial Networks, and based on new fusion data set these five categories. Use three models of NIC, Hard-Attention and Neural Talk to conduct experiments on the real data set MS-COCO data set, and compare the average scores of BLEU1, BLEU2, BLEU3, and BLEU4 to show the effects of the three models. This article points out the development trend of image caption in the future, and the challenges that image caption will face and the research directions that can be digged in. © 2021, Chinese Institute of Electronics. All right reserved.
引用
收藏
页码:2048 / 2060
页数:12
相关论文
共 64 条
  • [51] Li N N, Chen Z Z., Learning compact reward for image captioning, (2020)
  • [52] Chen H G, Zhang H, Chen P Y, Et al., Attacking visual language grounding with adversarial examples: A case study on neural image captioning, Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2587-2597, (2018)
  • [53] Shekhar R, Pezzelle S, Klimovich Y, Et al., FOIL it! Find One mismatch between Image and Language caption, Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 255-265, (2017)
  • [54] Dai B, Lin D H., Contrastive learning for image captioning, (2017)
  • [55] Feng Y, Ma L, Liu W, Et al., Unsupervised image captioning, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4120-4129, (2019)
  • [56] Bhargava S, Forsyth D., Exposing and correcting the gender bias in image captioning datasets and models, (2019)
  • [57] Shuster K, Humeau S, Hu H X, Et al., Engaging image captioning via personality, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12508-12518, (2019)
  • [58] Kim D J, Choi J, Oh T H, Et al., Dense relational captioning: Triple-stream networks for relationship-based captioning, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6264-6273, (2019)
  • [59] Biten A F, Gomez L, Rusinol M, Et al., Good news, everyone! context driven entity-aware captioning for news images, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12458-12467, (2019)
  • [60] Guo L T, Liu J, Yao P, Et al., MSCap: multi-style image captioning with unpaired stylized text, 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4199-4208, (2019)