MODELING LOCAL AND GLOBAL CONTEXTS FOR IMAGE CAPTIONING

被引:0
|
作者
Yao, Peng [1 ]
Li, Jiangyun [1 ]
Guo, Longteng [2 ]
Liu, Jing [2 ]
机构
[1] Univ Sci & Technol Beijing, Sch Automat & Elect Engn, Beijing, Peoples R China
[2] Chinese Acad Sci, Inst Automat, Natl Lab Pattern Recognit, Beijing, Peoples R China
基金
北京市自然科学基金;
关键词
Image captioning; self-attention; 1-D group convolution; image refiner;
D O I
10.1109/icme46284.2020.9102935
中图分类号
TP31 [计算机软件];
学科分类号
081202 ; 0835 ;
摘要
Image captioning aims to first observe an image, most notably the involved objects that are highly context-dependent, and then depict it with a natural description. However, most of the current models solely use the isolated objects vectors as image representations, ignoring the contexts among them. In this paper, we introduce a Local-Global Context (LGC) network, endowing the independent object features with short-range perception (local contexts) and long-range dependence (global contexts). LGC network can be viewed as feature refiner, much beneficial to reason the novel objects and verbal words for the caption decoder. The local contexts are modeled with 1-D group convolution on adjacent objects, strengthening the local connections. Still further, self-attention mechanism is utilized to model the global contexts by correlating all the local contexts. Extensive experiments on MSCOCO dataset demonstrate that LGC network can easily plug into almost any neural captioning models and significantly improve the model performance.
引用
收藏
页数:6
相关论文
共 50 条