Relation constraint self-attention for image captioning

被引:13
|
作者
Ji, Junzhong [1 ,2 ]
Wang, Mingzhan [1 ,2 ]
Zhang, Xiaodan [1 ,2 ]
Lei, Minglong [1 ,2 ]
Qu, Liangqiong [3 ]
机构
[1] Beijing Univ Technol, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligen, Beijing 100124, Peoples R China
[2] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Beijing 100124, Peoples R China
[3] Stanford Univ, Dept Biomed Data Sci, Palo Alto, CA 94304 USA
基金
中国国家自然科学基金;
关键词
Image captioning; Relation constraint self -attention; Scene graph; Transformer;
D O I
10.1016/j.neucom.2022.06.062
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:778 / 789
页数:12
相关论文
共 50 条
  • [1] Improve Image Captioning by Self-attention
    Li, Zhenru
    Li, Yaoyi
    Lu, Hongtao
    NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 91 - 98
  • [2] Variational joint self-attention for image captioning
    Shao, Xiangjun
    Xiang, Zhenglong
    Li, Yuanxiang
    Zhang, Mingjie
    IET IMAGE PROCESSING, 2022, 16 (08) : 2075 - 2086
  • [3] A Dual Self-Attention based Network for Image Captioning
    Li, ZhiYong
    Yang, JinFu
    Li, YaPing
    PROCEEDINGS OF THE 33RD CHINESE CONTROL AND DECISION CONFERENCE (CCDC 2021), 2021, : 1590 - 1595
  • [4] Transformer with sparse self-attention mechanism for image captioning
    Wang, Duofeng
    Hu, Haifeng
    Chen, Dihu
    ELECTRONICS LETTERS, 2020, 56 (15) : 764 - +
  • [5] Dual-stream Self-attention Network for Image Captioning
    Wan, Boyang
    Jiang, Wenhui
    Fang, Yuming
    Wen, Wenying
    Liu, Hantao
    2022 IEEE INTERNATIONAL CONFERENCE ON VISUAL COMMUNICATIONS AND IMAGE PROCESSING (VCIP), 2022,
  • [6] Multi-Branch Distance-Sensitive Self-Attention Network for Image Captioning
    Ji, Jiayi
    Huang, Xiaoyang
    Sun, Xiaoshuai
    Zhou, Yiyi
    Luo, Gen
    Cao, Liujuan
    Liu, Jianzhuang
    Shao, Ling
    Ji, Rongrong
    IEEE TRANSACTIONS ON MULTIMEDIA, 2023, 25 : 3962 - 3974
  • [7] Bi-SAN-CAP: Bi-Directional Self-Attention for Image Captioning
    Hossain, Md Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    Bennamoun, Mohammed
    2019 DIGITAL IMAGE COMPUTING: TECHNIQUES AND APPLICATIONS (DICTA), 2019, : 167 - 173
  • [8] Object Relation Attention for Image Paragraph Captioning
    Yang, Li-Chuan
    Yang, Chih-Yuan
    Hsu, Jane Yung-jen
    THIRTY-FIFTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THIRTY-THIRD CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE AND THE ELEVENTH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2021, 35 : 3136 - 3144
  • [9] Fashion item captioning via grid-relation self-attention and gated-enhanced decoder
    Tang, Yuhao
    Zhang, Liyan
    Yuan, Ye
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (03) : 7631 - 7655
  • [10] Fashion item captioning via grid-relation self-attention and gated-enhanced decoder
    Yuhao Tang
    Liyan Zhang
    Ye Yuan
    Multimedia Tools and Applications, 2024, 83 : 7631 - 7655