Input enhanced asymmetric transformer for image captioning

被引:2
|
作者
Zhu, Chenhao [1 ]
Ye, Xia [1 ]
Lu, Qiduo [1 ]
机构
[1] Xian Res Inst High Tech, Xian 710025, Peoples R China
关键词
Image caption; Adaptive sparse attention; Vision; Language pretraining;
D O I
10.1007/s11760-022-02350-9
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image caption is a popular research direction in computer vision. It is a task that enables machines to convey the computer's perception and cognition of vision to the outside world in the form of human language. Currently, the most dominant models are Transformer-based architectures which achieve the cutting-edge performance. Inspired by the distinguished meshed-memory transformer model which uses a mesh-like connectivity at decoding stage. It let us see more possibilities in the Transformer model. With the aim to explore more possible connectivity schemas in Transformer, we propose the input enhanced asymmetric transformer (IEAT) model. It improves the connectivity between encoder layers and optimizes the generation effect of the captions. To better evaluate the final effect of our model, we conducted extensive experiments (offline evaluation, online evaluation and ablation study) on the MS-COCO benchmark and the "Karpathy" test split. And the results show that IEAT outperforms the previously proposed models to generate satisfactory image captions.
引用
收藏
页码:1419 / 1427
页数:9
相关论文
共 50 条
  • [21] Image captioning with transformer and knowledge graph
    Zhang, Yu
    Shi, Xinyu
    Mi, Siya
    Yang, Xu
    PATTERN RECOGNITION LETTERS, 2021, 143 (143) : 43 - 49
  • [22] Complementary Shifted Transformer for Image Captioning
    Yanbo Liu
    You Yang
    Ruoyu Xiang
    Jixin Ma
    Neural Processing Letters, 2023, 55 : 8339 - 8363
  • [23] Recurrent fusion transformer for image captioning
    Mou, Zhenping
    Yuan, Qiao
    Song, Tianqi
    SIGNAL IMAGE AND VIDEO PROCESSING, 2025, 19 (01)
  • [24] Relational-Convergent Transformer for image captioning
    Chen, Lizhi
    Yang, You
    Hu, Juntao
    Pan, Longyue
    Zhai, Hao
    DISPLAYS, 2023, 77
  • [25] MIXED KNOWLEDGE RELATION TRANSFORMER FOR IMAGE CAPTIONING
    Chen, Tianyu
    Li, Zhixin
    Wei, Jiahui
    Xian, Tiantao
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4403 - 4407
  • [26] Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning
    Li, Jingyu
    Mao, Zhendong
    Li, Hao
    Chen, Weidong
    Zhang, Yongdong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [27] Context-aware transformer for image captioning
    Yang, Xin
    Wang, Ying
    Chen, Haishun
    Li, Jie
    Huang, Tingting
    NEUROCOMPUTING, 2023, 549
  • [28] A Position-Aware Transformer for Image Captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081
  • [29] Full-Memory Transformer for Image Captioning
    Lu, Tongwei
    Wang, Jiarong
    Min, Fen
    SYMMETRY-BASEL, 2023, 15 (01):
  • [30] A position-aware transformer for image captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    Deng, Zelin (zl_deng@sina.com), 2005, Tech Science Press (70): : 2005 - 2021