Complementary Shifted Transformer for Image Captioning

被引:1
|
作者
Liu, Yanbo [1 ]
Yang, You [2 ]
Xiang, Ruoyu [1 ]
Ma, Jixin [1 ]
机构
[1] Chongqing Normal Univ, Sch Comp & Informat Sci, Chongqing 401331, Peoples R China
[2] Natl Ctr Appl Math Chongqing, Chongqing 401331, Peoples R China
关键词
Image captioning; Transformer; Positional encoding; Multi-branch self-attention; Spatial shift;
D O I
10.1007/s11063-023-11314-0
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Transformer-basedmodels have dominated many vision and language tasks, including image captioning. However, such models still suffer from the limitation of expressive ability and information loss during dimensionality reduction. In order to solve the above problems, this paper proposes a Complementary Shifted Transformer (CST) for image captioning. We first introduce a complementary Multi-branch Bi-positional encoding Self-Attention (MBSA) module. It utilizes both absolute and relative positional encoding to learn precise positional representations. Meanwhile, MBSA is equipped with Multi-Branch Architecture, which replicates multiple branches for each head. To improve the expressive ability of the model, we utilize the drop branch technique, which trains the branches in a complementary way. Furthermore, we propose a Spatial Shift Augmented module, which takes advantage of both low-level and high-level features to enhance visual features with fewer parameters. To validate our model, we conduct extensive experiments on the MSCOCO benchmark dataset. Compared to the state-of-the-art methods, the proposed CST achieves a competitive performance of 135.3% CIDEr (+0.2%) on the Karpathy split and 136.3% CIDEr (+0.9%) on the official online test server. In addition, we also evaluate the inference performance of our model on a novel object dataset. The source codes and trained models are publicly available at https://github.com/noonisy/CST.
引用
收藏
页码:8339 / 8363
页数:25
相关论文
共 50 条
  • [21] Input enhanced asymmetric transformer for image captioning
    Chenhao Zhu
    Xia Ye
    Qiduo Lu
    Signal, Image and Video Processing, 2023, 17 : 1419 - 1427
  • [22] Attention-Aligned Transformer for Image Captioning
    Fei, Zhengcong
    THIRTY-SIXTH AAAI CONFERENCE ON ARTIFICIAL INTELLIGENCE / THIRTY-FOURTH CONFERENCE ON INNOVATIVE APPLICATIONS OF ARTIFICIAL INTELLIGENCE / THE TWELVETH SYMPOSIUM ON EDUCATIONAL ADVANCES IN ARTIFICIAL INTELLIGENCE, 2022, : 607 - 615
  • [23] Context-assisted Transformer for Image Captioning
    Lian Z.
    Wang R.
    Li H.-C.
    Yao H.
    Hu X.-H.
    Zidonghua Xuebao/Acta Automatica Sinica, 2023, 49 (09): : 1889 - 1903
  • [24] Dual Position Relationship Transformer for Image Captioning
    Wang, Yaohan
    Qian, Wenhua
    Nie, Rencan
    Xu, Dan
    Cao, Jinde
    Kim, Pyoungwon
    BIG DATA, 2022, 10 (06) : 515 - 527
  • [25] Position-guided transformer for image captioning
    Hu, Juntao
    Yang, You
    Yao, Lu
    An, Yongzhi
    Pan, Longyue
    IMAGE AND VISION COMPUTING, 2022, 128
  • [26] SPT: Spatial Pyramid Transformer for Image Captioning
    Zhang, Haonan
    Zeng, Pengpeng
    Gao, Lianli
    Lyu, Xinyu
    Song, Jingkuan
    Shen, Heng Tao
    IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, 2024, 34 (06) : 4829 - 4842
  • [27] Input enhanced asymmetric transformer for image captioning
    Zhu, Chenhao
    Ye, Xia
    Lu, Qiduo
    SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (04) : 1419 - 1427
  • [28] Improved Transformer with Parallel Encoders for Image Captioning
    Lou, Liangshan
    Lu, Ke
    Xue, Jian
    2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4072 - 4078
  • [29] Semi-Autoregressive Transformer for Image Captioning
    Zhou, Yuanen
    Zhang, Yong
    Hu, Zhenzhen
    Wang, Meng
    2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3132 - 3136
  • [30] HIST: Hierarchical and sequential transformer for image captioning
    Lv, Feixiao
    Wang, Rui
    Jing, Lihua
    Dai, Pengwen
    IET COMPUTER VISION, 2024, 18 (07) : 1043 - 1056