Direction Relation Transformer for Image Captioning

被引:17
|
作者
Song, Zeliang [1 ,2 ]
Zhou, Xiaofei [1 ,2 ]
Dong, Linhua [1 ,2 ]
Tan, Jianlong [1 ,2 ]
Guo, Li [1 ,2 ]
机构
[1] Chinese Acad Sci, Inst Informat Engn, Beijing, Peoples R China
[2] Univ Chinese Acad Sci, Sch Cyber Secur, Beijing, Peoples R China
基金
中国国家自然科学基金;
关键词
Image Captioning; Direction Relation Transformer; Multi-Head Attention; Direction Embedding;
D O I
10.1145/3474085.3475607
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a challenging task that combines computer vision and natural language processing for generating a textual description of the content within an image. Recently, Transformerbased encoder-decoder architectures have shown great success in image captioning, where multi-head attention mechanism is utilized to capture the contextual interactions between object regions. However, such methods regard region features as a bag of tokens without considering the directional relationships between them, making it hard to understand the relative position between objects in the image and generate correct captions effectively. In this paper, we propose a novel Direction Relation Transformer to improve the orientation perception between visual features by incorporating the relative direction embedding into multi-head attention, termed DRT. We first generate the relative direction matrix according to the positional information of the object regions, and then explore three forms of direction-aware multi-head attention to integrate the direction embedding into Transformer architecture. We conduct experiments on challenging Microsoft COCO image captioning benchmark. The quantitative and qualitative results demonstrate that, by integrating the relative directional relation, our proposed approach achieves significant improvements over all evaluation metrics compared with baseline model, e.g., DRT improves taskspecific metric CIDEr score from 129.7% to 133.2% on the offline '' Karpathy '' test split.
引用
收藏
页码:5056 / 5064
页数:9
相关论文
共 50 条
  • [31] HIST: Hierarchical and sequential transformer for image captioning
    Lv, Feixiao
    Wang, Rui
    Jing, Lihua
    Dai, Pengwen
    IET COMPUTER VISION, 2024, 18 (07) : 1043 - 1056
  • [32] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
  • [33] External knowledge-assisted Transformer for image captioning
    Li, Zhixin
    Su, Qiang
    Chen, Tianyu
    IMAGE AND VISION COMPUTING, 2023, 140
  • [34] Dual-Spatial Normalized Transformer for image captioning
    Hu, Juntao
    Yang, You
    An, Yongzhi
    Yao, Lu
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [35] Caption TLSTMs: combining transformer with LSTMs for image captioning
    Yan, Jie
    Xie, Yuxiang
    Luan, Xidao
    Guo, Yanming
    Gong, Quanzhi
    Feng, Suru
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
  • [36] Reinforcement Learning Transformer for Image Captioning Generation Model
    Yan, Zhaojie
    FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [37] Improving Stylized Image Captioning with Better Use of Transformer
    Tan, Yutong
    Lin, Zheng
    Liu, Huan
    Zuo, Fan
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
  • [38] Graph Alignment Transformer for More Grounded Image Captioning
    Tian, Canwei
    Hu, Haiyang
    Li, Zhongjin
    2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC, 2022, : 95 - 102
  • [39] Visual contextual relationship augmented transformer for image captioning
    Su, Qiang
    Hu, Junbo
    Li, Zhixin
    APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
  • [40] Spiking -Transformer Optimization on FPGA for Image Classification and Captioning
    Udeji, Uchechukwu Leo
    Margala, Martin
    SOUTHEASTCON 2024, 2024, : 1353 - 1357