Relation constraint self-attention for image captioning

被引:13
|
作者
Ji, Junzhong [1 ,2 ]
Wang, Mingzhan [1 ,2 ]
Zhang, Xiaodan [1 ,2 ]
Lei, Minglong [1 ,2 ]
Qu, Liangqiong [3 ]
机构
[1] Beijing Univ Technol, Fac Informat Technol, Beijing Municipal Key Lab Multimedia & Intelligen, Beijing 100124, Peoples R China
[2] Beijing Univ Technol, Beijing Inst Artificial Intelligence, Beijing 100124, Peoples R China
[3] Stanford Univ, Dept Biomed Data Sci, Palo Alto, CA 94304 USA
基金
中国国家自然科学基金;
关键词
Image captioning; Relation constraint self -attention; Scene graph; Transformer;
D O I
10.1016/j.neucom.2022.06.062
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention based Transformer has been successfully introduced in the encoder-decoder framework of image captioning, which is superior in modeling the inner relations of inputs, i.e., image regions or semantic words. However, relations in self-attention are usually too dense to be fully optimized, which may result in noisy relations and attentions. Meanwhile, the prior relations, e.g., visual relation and semantic relation between objects, which are essential for understanding and describing an image, are ignored by current self-attention. Thus, the relation learning of self-attention in image captioning is biased, which leads to a dilution of the concentration of attentions. In this paper, we propose a Relation Constraint Self-Attention (RCSA) model to enhance the relation learning of self-attention in image captioning by constraining self-attention with prior relations. RCSA exploits the prior visual and semantic relation information from scene graph as constraint factors. And then it builds constraints for self-attention through two sub-modules: an RCSA-E encoder module and an RCSA-D decoder module. RCSA-E introduces the visual relation information to self-attention in encoder, which helps generate a sparse attention map by omitting the attention weights of irrelevant regions to highlight relevant visual features. RCSA-D extends the keys and values of self-attention in decoder with the semantic relation information to constrain the learning of semantic relation, and improve the accuracy of generated semantic words. Intuitively, RCSA-E endows model with an ability to figure out which region to omit and which region to focus by visual relation information; RCSA-D then strengthens the relation learning of the focused regions and improves the sentence generation with semantic relation information. Experiments on the MSCOCO dataset demonstrate the effectiveness of our proposed RCSA.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页码:778 / 789
页数:12
相关论文
共 50 条
  • [41] Unsupervised image-to-image translation by semantics consistency and self-attention
    Zhibin Zhang
    Wanli Xue
    Guokai Fu
    Optoelectronics Letters, 2022, 18 : 175 - 180
  • [42] Unsupervised image-to-image translation by semantics consistency and self-attention
    ZHANG Zhibin
    XUE Wanli
    FU Guokai
    Optoelectronics Letters, 2022, 18 (03) : 175 - 180
  • [43] Research on Cascaded Labeling Framework for Relation Extraction with Self-Attention
    Xiao, Lizhong
    Zang, Zhongxing
    Song, Saisai
    Computer Engineering and Applications, 59 (03): : 77 - 83
  • [44] RKT : Relation-Aware Self-Attention for Knowledge Tracing
    Pandey, Shalini
    Srivastava, Jaideep
    CIKM '20: PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON INFORMATION & KNOWLEDGE MANAGEMENT, 2020, : 1205 - 1214
  • [45] SELF-ATTENTION RELATION NETWORK FOR FEW-SHOT LEARNING
    Hui, Binyuan
    Zhu, Pengfei
    Hu, Qinghua
    Wang, Qilong
    2019 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA & EXPO WORKSHOPS (ICMEW), 2019, : 198 - 203
  • [46] Sequential Recommendation with Relation-Aware Kernelized Self-Attention
    Ji, Mingi
    Joo, Weonyoung
    Song, Kyungwoo
    Kim, Yoon-Yeong
    Moon, Il-Chul
    THIRTY-FOURTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE, THE THIRTY-SECOND INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE CONFERENCE AND THE TENTH AAAI SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2020, 34 : 4304 - 4311
  • [47] Underwater image imbalance attenuation compensation based on attention and self-attention mechanism
    Wang, Danxu
    Wei, Yanhui
    Liu, Junnan
    Ouyang, Wenjia
    Zhou, Xilin
    2022 OCEANS HAMPTON ROADS, 2022,
  • [48] RelNet-MAM: Relation Network with Multilevel Attention Mechanism for Image Captioning
    Srivastava, Swati
    Sharma, Himanshu
    MICROPROCESSORS AND MICROSYSTEMS, 2023, 102
  • [49] Decomformer: Decompose Self-Attention of Transformer for Efficient Image Restoration
    Lee, Eunho
    Hwang, Youngbae
    IEEE ACCESS, 2024, 12 : 38672 - 38684
  • [50] Self-attention random forest for breast cancer image classification
    Li, Jia
    Shi, Jingwen
    Chen, Jianrong
    Du, Ziqi
    Huang, Li
    FRONTIERS IN ONCOLOGY, 2023, 13