Input enhanced asymmetric transformer for image captioning

被引:2
|
作者
Zhu, Chenhao [1 ]
Ye, Xia [1 ]
Lu, Qiduo [1 ]
机构
[1] Xian Res Inst High Tech, Xian 710025, Peoples R China
关键词
Image caption; Adaptive sparse attention; Vision; Language pretraining;
D O I
10.1007/s11760-022-02350-9
中图分类号
TM [电工技术]; TN [电子技术、通信技术];
学科分类号
0808 ; 0809 ;
摘要
Image caption is a popular research direction in computer vision. It is a task that enables machines to convey the computer's perception and cognition of vision to the outside world in the form of human language. Currently, the most dominant models are Transformer-based architectures which achieve the cutting-edge performance. Inspired by the distinguished meshed-memory transformer model which uses a mesh-like connectivity at decoding stage. It let us see more possibilities in the Transformer model. With the aim to explore more possible connectivity schemas in Transformer, we propose the input enhanced asymmetric transformer (IEAT) model. It improves the connectivity between encoder layers and optimizes the generation effect of the captions. To better evaluate the final effect of our model, we conducted extensive experiments (offline evaluation, online evaluation and ablation study) on the MS-COCO benchmark and the "Karpathy" test split. And the results show that IEAT outperforms the previously proposed models to generate satisfactory image captions.
引用
收藏
页码:1419 / 1427
页数:9
相关论文
共 50 条
  • [41] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
  • [42] External knowledge-assisted Transformer for image captioning
    Li, Zhixin
    Su, Qiang
    Chen, Tianyu
    IMAGE AND VISION COMPUTING, 2023, 140
  • [43] Dual-Spatial Normalized Transformer for image captioning
    Hu, Juntao
    Yang, You
    An, Yongzhi
    Yao, Lu
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 123
  • [44] Caption TLSTMs: combining transformer with LSTMs for image captioning
    Yan, Jie
    Xie, Yuxiang
    Luan, Xidao
    Guo, Yanming
    Gong, Quanzhi
    Feng, Suru
    INTERNATIONAL JOURNAL OF MULTIMEDIA INFORMATION RETRIEVAL, 2022, 11 (02) : 111 - 121
  • [45] Reinforcement Learning Transformer for Image Captioning Generation Model
    Yan, Zhaojie
    FIFTEENTH INTERNATIONAL CONFERENCE ON MACHINE VISION, ICMV 2022, 2023, 12701
  • [46] Improving Stylized Image Captioning with Better Use of Transformer
    Tan, Yutong
    Lin, Zheng
    Liu, Huan
    Zuo, Fan
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
  • [47] Graph Alignment Transformer for More Grounded Image Captioning
    Tian, Canwei
    Hu, Haiyang
    Li, Zhongjin
    2022 INTERNATIONAL CONFERENCE ON INDUSTRIAL IOT, BIG DATA AND SUPPLY CHAIN, IIOTBDSC, 2022, : 95 - 102
  • [48] Visual enhanced gLSTM for image captioning
    Zhang, Jing
    Li, Kangkang
    Wang, Zhenkun
    Zhao, Xianwen
    Wang, Zhe
    EXPERT SYSTEMS WITH APPLICATIONS, 2021, 184
  • [49] Visual contextual relationship augmented transformer for image captioning
    Su, Qiang
    Hu, Junbo
    Li, Zhixin
    APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
  • [50] Spiking -Transformer Optimization on FPGA for Image Classification and Captioning
    Udeji, Uchechukwu Leo
    Margala, Martin
    SOUTHEASTCON 2024, 2024, : 1353 - 1357