Image Captioning Model Based on Multi Level Visual Fusion

被引：0

作者：

Zhou D.-M. ^{[1
]}

Zhang C.-L. ^{[1
]}

Li Z.-X. ^{[1
]}

Wang Z.-W. ^{[2
]}

机构：

[1] Guangxi Key Laboratory of Multi source Information Mining and Security, Guangxi Normal University, Guilin

[2] School of Computer Science and Communication Engineering, Guangxi University of Science and Technology, Liuzhou

来源：

Tien Tzu Hsueh Pao/Acta Electronica Sinica | 2021年 / 49卷 / 07期

关键词：

Attention mechanism; Image captioning; Machine learning; Reinforcement learning; Strategy network; Visual fusion;

D O I：

10.12263/DZXB.20191296

中图分类号：

学科分类号：

摘要：

Traditional methods only focus on entities in the visual strategy network and cannot deduce the relationship between entities and attributes. There are problems of exposure bias and error accumulation in the language strategy network. Therefore, this paper proposes a multi level visual fusion network model based on reinforcement learning. In the visual strategy network, multi level sub neural network module is used to transform visual features into feature sets of visual knowledge. The fusion network generates the function words which make the description sentences more fluent and can be used for the interaction between the visual strategy network and the language strategy network. The gradient algorithm of self criticism strategy based on reinforcement learning is used to optimize the visual fusion network end to end. The experimental results show that the model can get good results in MS COCO data set and improve the CIDEr value of Karpathy segmentation test from 120.1 to 124.3. © 2021, Chinese Institute of Electronics. All right reserved.

引用

页码：1286 / 1290

页数：4

共 12 条

[1] Chen S, Jin Q, Wang P., Say as you wish: fine grained control of image caption generation with abstract scene graphs, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9962-9971, (2020)
[2] Shi J, Zhang H, Li J., Explainable and explicit visual reasoning over scene graphs, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8376-8384, (2019)
[3] Rennie S J, Marcheret E, Mroueh Y., Self critical sequence training for image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008-7024, (2017)
[4] Lu J, Yang J, Batra D., Neural baby talk, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7219-7228, (2018)
[5] Anderson P, He X, Buehler C., Bottom up and top down attention for image captioning and visual question answering, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077-6086, (2018)
[6] Deshpande A, Aneja J, Wang L., Fast, diverse and accurate image captioning guided by part of speech, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10695-10704, (2019)
[7] Yang X, Tang K, Zhang H., Auto encoding scene graphs for image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10685-10694, (2019)
[8] Chen L, Zhang H, Xiao J., SCA CNN: spatial and channel wise attention in convolutional networks for image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5659-5667, (2017)
[9] Feng Y, Ma L, Liu W., Unsupervised image captioning, Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4125-4134, (2019)
[10] Jiang W, Ma L, Jiang Y., Recurrent fusion network for image captioning, Proceedings of the European Conference on Computer Vision, pp. 499-515, (2018)

← 1 2 →