Input enhanced asymmetric transformer for image captioning

被引:2
|
作者
Zhu, Chenhao [1 ]
Ye, Xia [1 ]
Lu, Qiduo [1 ]
机构
[1] Xian Res Inst High Tech, Xian 710025, Peoples R China
关键词
Image caption; Adaptive sparse attention; Vision; Language pretraining;
D O I
10.1007/s11760-022-02350-9
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image caption is a popular research direction in computer vision. It is a task that enables machines to convey the computer's perception and cognition of vision to the outside world in the form of human language. Currently, the most dominant models are Transformer-based architectures which achieve the cutting-edge performance. Inspired by the distinguished meshed-memory transformer model which uses a mesh-like connectivity at decoding stage. It let us see more possibilities in the Transformer model. With the aim to explore more possible connectivity schemas in Transformer, we propose the input enhanced asymmetric transformer (IEAT) model. It improves the connectivity between encoder layers and optimizes the generation effect of the captions. To better evaluate the final effect of our model, we conducted extensive experiments (offline evaluation, online evaluation and ablation study) on the MS-COCO benchmark and the "Karpathy" test split. And the results show that IEAT outperforms the previously proposed models to generate satisfactory image captions.
引用
收藏
页码:1419 / 1427
页数:9
相关论文
共 50 条
  • [1] Input enhanced asymmetric transformer for image captioning
    Chenhao Zhu
    Xia Ye
    Qiduo Lu
    Signal, Image and Video Processing, 2023, 17 : 1419 - 1427
  • [2] Dual Global Enhanced Transformer for image captioning
    Xian, Tiantao
    Li, Zhixin
    Zhang, Canlong
    Ma, Huifang
    NEURAL NETWORKS, 2022, 148 : 129 - 141
  • [3] Adaptive Semantic-Enhanced Transformer for Image Captioning
    Zhang, Jing
    Fang, Zhongjun
    Sun, Han
    Wang, Zhe
    IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2024, 35 (02) : 1785 - 1796
  • [4] Relational Attention with Textual Enhanced Transformer for Image Captioning
    Song, Lifei
    Shi, Yiwen
    Xiao, Xinyu
    Zhang, Chunxia
    Xiang, Shiming
    PATTERN RECOGNITION AND COMPUTER VISION,, PT III, 2021, 13021 : 151 - 163
  • [5] Triple-level relationship enhanced transformer for image captioning
    Zheng, Anqi
    Zheng, Shiqi
    Bai, Cong
    Chen, Deng
    MULTIMEDIA SYSTEMS, 2023, 29 (04) : 1955 - 1966
  • [6] Dual-visual collaborative enhanced transformer for image captioning
    Mou, Zhenping
    Song, Tianqi
    Luo, Hong
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [7] Style-Enhanced Transformer for Image Captioning in Construction Scenes
    Song, Kani
    Chen, Linlin
    Wang, Hengyou
    ENTROPY, 2024, 26 (03)
  • [8] Triple-level relationship enhanced transformer for image captioning
    Anqi Zheng
    Shiqi Zheng
    Cong Bai
    Deng Chen
    Multimedia Systems, 2023, 29 : 1955 - 1966
  • [9] Distance Transformer for Image Captioning
    Wang, Jiarong
    Lu, Tongwei
    Liu, Xuanxuan
    Yang, Qi
    2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
  • [10] Rotary Transformer for Image Captioning
    Qiu, Yile
    Zhu, Li
    SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328