An image understanding method based on multi-level semantic features

被引：0

作者：

Mo H.-W. ^{[1
]}

Tian P. ^{[1
]}

机构：

[1] College of Intelligent Systems Science and Engineering, Harbin Engineering University, Harbin

来源：

Kongzhi yu Juece/Control and Decision | 2021年 / 36卷 / 12期

关键词：

Attention mechanism; Image captioning; Image understanding; Scene graph; Semantic feature; Semantic level; Visual relationship;

D O I：

10.13195/j.kzyjc.2020.0927

中图分类号：

学科分类号：

摘要：

Visual scene understanding includes detecting and recognizing objects, reasoning the visual relationships of the detected objects, and describing image regions with sentences. In order to achieve the more comprehensive and accurate understanding of scene image, we view object detection, visual relationship detection and image captioning as three visual tasks at different semantic levels in scene understanding, so as to propose an image understanding model based on multi-level semantic features to leverage the mutual connections across the three different semantic layers to solve the scene understanding tasks jointly. The model iterates and updates the semantic features of objects, relationship phrases and image captioning simultaneously through a message pass graph. The updated semantic features are used to classify objects and visual relationships, generate scene graphs and captions, and introduce a fusion attention mechanism to improve the accuracy of captions. The experimental results on the visual genome and COCO datasets show that the proposed method outperforms the existing methods on the scene graph generation and image captioning tasks. © 2021, Editorial Office of Control and Decision. All right reserved.

引用

页码：2881 / 2890

页数：9

共 42 条

[1] Li X P, Zhang B, Sun F C, Et al., Indoor scene understanding by fusing multi-view RGB-D image frames, Journal of Computer Research and Development, 57, 6, pp. 1218-1226, (2020)
[2] Liu Y G, Yu J Z, Han Y H, Et al., Understanding the effective receptive field in semantic image segmentation, Multimedia Tools and Applications, 77, 17, pp. 22159-22171, (2018)
[3] Yatskar M, Zettlemoyer L, Farhadi A, Et al., Situation recognition: Visual semantic role labeling for image understanding, Computer Vision and Pattern Recognition, pp. 5534-5542, (2016)
[4] Zitnick C L, Parikh D, Vanderwende L, Et al., Learning the visual interpretation of sentences, International Conference on Computer Vision, pp. 1681-1688, (2013)
[5] Desai C, Ramanan D, Fowlkes C C, Et al., Discriminative models for static human-object interactions, Computer Vision and Pattern Recognition, pp. 9-16, (2010)
[6] Yao B D, Li F F., Modeling mutual context of object and human pose in human-object interaction activities, Computer Vision and Pattern Recognition, pp. 17-24, (2010)
[7] Sadeghi M A, Farhadi A., Recognition using visual phrases, Computer Vision and Pattern Recognition, pp. 1745-1752, (2012)
[8] Li Y K, Ouyang W L, Zhou B L, Et al., Scene graph generation from objects, phrases and region captions, International Conference on Computer Vision, pp. 1270-1279, (2017)
[9] Shin D, Kim I., Deep image understanding using multilayered contexts, Mathematical Problems in Engineering, 2018, pp. 1-11, (2018)
[10] Krishna R, Zhu Y K, Groth O, Et al., Visual genome: Connecting language and vision using crowdsourced dense image annotations, International Journal of Computer Vision, 123, 1, pp. 32-73, (2017)

← 1 2 3 4 5 →