Position-guided transformer for image captioning

被引:5
|
作者
Hu, Juntao [1 ]
Yang, You [2 ]
Yao, Lu [1 ]
An, Yongzhi [1 ]
Pan, Longyue [1 ]
机构
[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Chongqing Normal Univ, Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China
关键词
Image captioning; Bi-positional attention; Position encoding; Group normalization; Transformer; Self-attention;
D O I
10.1016/j.imavis.2022.104575
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-based frameworks have shown superiorities in image captioning. However, such frameworks are strenuous to consider geometric interrelations among visual contents in an image, as well as fail to prevent changes in the distribution of each layer's input in self-attention. In this work, we first propose a Bi-Positional At-tention (BPA) module, which incorporates absolute and relative position encoding to precisely explore internal relations between objects and their geometric information in an image. Additionally, we use a Group Normaliza-tion (GN) method inside BPA to relieve shifts of the distribution and better exploit the channel dependence of visual features. To validate our proposals, we apply BPA and GN into the original Transformer to constitute our Position-Guided Transformer (PGT) network, which learns a more comprehensive positional representations to augment spatial interactions among objects for image captioning. We conduct extensive experiments to verify the effectiveness of our model. Compared with non-pretraining state-of-the-art methods, experimental results on the MSCOCO benchmark dataset demonstrate that our PGT achieves competitive performance, reaching 134.2% CIDEr score on the Karpathy split with a single model, and 136.2% CIDEr score on the official testing server with an ensemble configuration.(c) 2022 Elsevier B.V. All rights reserved.
引用
收藏
页数:11
相关论文
共 50 条
  • [1] Position-Guided Point Cloud Panoptic Segmentation Transformer
    Xiao, Zeqi
    Zhang, Wenwei
    Wang, Tai
    Loy, Chen Change
    Lin, Dahua
    Pang, Jiangmiao
    [J]. INTERNATIONAL JOURNAL OF COMPUTER VISION, 2024,
  • [2] A position-aware transformer for image captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    [J]. Deng, Zelin (zl_deng@sina.com), 2005, Tech Science Press (70): : 2005 - 2021
  • [3] A Position-Aware Transformer for Image Captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081
  • [4] Dual Position Relationship Transformer for Image Captioning
    Wang, Yaohan
    Qian, Wenhua
    Nie, Rencan
    Xu, Dan
    Cao, Jinde
    Kim, Pyoungwon
    [J]. BIG DATA, 2022, 10 (06) : 515 - 527
  • [5] PCTrans: Position-Guided Transformer with Query Contrast for Biological Instance Segmentation
    Chen, Qi
    Huang, Wei
    Liu, Xiaoyu
    Li, Jiacheng
    Xiong, Zhiwei
    [J]. 2023 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS, ICCVW, 2023, : 3905 - 3914
  • [6] PCATNet: Position-Class Awareness Transformer for Image Captioning
    Tang, Ziwei
    Yi, Yaohua
    Yu, Changhui
    Yin, Aiguo
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2023, 75 (03): : 6007 - 6022
  • [7] Image Captioning Based on An Improved Transformer with IoU Position Encoding
    Li, Yazhou
    Shi, Yihui
    Liu, Yun
    Li, Ruifan
    Ma, Zhanyu
    [J]. 2021 ASIA-PACIFIC SIGNAL AND INFORMATION PROCESSING ASSOCIATION ANNUAL SUMMIT AND CONFERENCE (APSIPA ASC), 2021, : 2066 - 2071
  • [8] Semantic association enhancement transformer with relative position for image captioning
    Jia, Xin
    Wang, Yunbo
    Peng, Yuxin
    Chen, Shengyong
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2022, 81 (15) : 21349 - 21367
  • [9] Semantic association enhancement transformer with relative position for image captioning
    Xin Jia
    Yunbo Wang
    Yuxin Peng
    Shengyong Chen
    [J]. Multimedia Tools and Applications, 2022, 81 : 21349 - 21367
  • [10] Region-guided transformer for remote sensing image captioning
    Zhao, Kai
    Xiong, Wei
    [J]. INTERNATIONAL JOURNAL OF DIGITAL EARTH, 2024, 17 (01)