Overview of Image Captions Based on Deep Learning

被引:0
|
作者
Shi Y.-L. [1 ]
Yang W.-Z. [2 ]
Du H.-X. [1 ]
Wang L.-H. [1 ]
Wang T. [1 ]
Li S.-S. [1 ]
机构
[1] Key Laboratory of Software Engineering Technology, Xinjiang University, Urumqi
[2] School of Information Science and Engineering, Xinjiang University, Urumqi
来源
关键词
Attention mechanism; Encoder-decoder framework; Intelligence-image understanding; Reinforcement learning;
D O I
10.12263/DZXB.20200669
中图分类号
学科分类号
摘要
Image caption aims to extract the features of the image and input the description of the final output image into the language generation model, which solves the intersection of natural language processing and computer vision in artificial intelligence-image understanding. Summarize and analyze representative thesis of image description orientation from 2015 to 2020, different core technologies as classification criteria, it can be roughly divided into: image caption based on Encoder-Decoder framework, image caption based on attention mechanism, image caption based on reinforcement learning, image caption based on Generative Adversarial Networks, and based on new fusion data set these five categories. Use three models of NIC, Hard-Attention and Neural Talk to conduct experiments on the real data set MS-COCO data set, and compare the average scores of BLEU1, BLEU2, BLEU3, and BLEU4 to show the effects of the three models. This article points out the development trend of image caption in the future, and the challenges that image caption will face and the research directions that can be digged in. © 2021, Chinese Institute of Electronics. All right reserved.
引用
收藏
页码:2048 / 2060
页数:12
相关论文
共 64 条
  • [1] Quan Y, Li Z X, Zhang C L, Et al., Fusing deep dilated convolutions network and light-weight network for object detection, Acta Electronica Sinica, 48, 2, pp. 390-397, (2020)
  • [2] Liu Y, Liu H Y, Fan J L, Et al., A survey of research and application of small object detection based on deep learning, Acta Electronica Sinica, 48, 3, pp. 590-601, (2020)
  • [3] Image caption的发展历程和最新工作的简要综述(2010-2018)
  • [4] Vinyals O, Toshev A, Bengio S, Et al., Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 4, pp. 652-663, (2017)
  • [5] Tan X, Ren Y, He D, Et al., Multilingual neural machine translation with knowledge distillation, (2019)
  • [6] Karpathy A, Li F F., Deep visual-semantic alignments for generating image descriptions, IEEE Transactions on Pattern Analysis and Machine Intelligence, 39, 4, pp. 664-676, (2017)
  • [7] Simonyan K, Zisserman A., Very deep convolutional networks for large-scale image recognition, (2014)
  • [8] Fang H, Gupta S, Iandola F, Et al., From captions to visual concepts and back, 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1473-1482, (2015)
  • [9] Li N, Chen Z., Image cationing with visual-semantic LSTM, Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence, pp. 793-799, (2018)
  • [10] Anderson P, He X D, Buehler C, Et al., Bottom-up and top-down attention for image captioning and visual question answering, 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6077-6086, (2018)