A Sparse Transformer-Based Approach for Image Captioning

被引:5
|
作者
Lei, Zhou [1 ]
Zhou, Congcong [1 ]
Chen, Shengbo [1 ]
Huang, Yiyong [1 ]
Liu, Xianrui [1 ]
机构
[1] Shanghai Univ, Sch Comp Engn & Sci, Shanghai 200444, Peoples R China
来源
IEEE ACCESS | 2020年 / 8卷
基金
中国国家自然科学基金;
关键词
Adaptation models; Decoding; Computer architecture; Sparse matrices; Visualization; Feature extraction; Task analysis; Image captioning; self-attention; explict sparse; local adaptive threshold;
D O I
10.1109/ACCESS.2020.3024639
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-art performances. However, it is difficult to model knowledge on relationships between input image region pairs in the encoder. Furthermore, the word in the decoder hardly knows the correlation to specific image regions. In this article, a novel deep encoder-decoder model is proposed for image captioning which is developed on sparse Transformer framework. The encoder adopts a multi-level representation of image features based on self-attention to exploit low-level and high-level features, naturally the correlations between image region pairs are adequately modeled as self-attention operation can be seen as a way of encoding pairwise relationships. The decoder improves the concentration of multi-head self-attention on the global context by explicitly selecting the most relevant segments at each row of the attention matrix. It can help the model focus on the more contributing image regions and generate more accurate words in the context. Experiments demonstrate that our model outperforms previous methods and achieves higher performance on MSCOCO and Flickr30k datasets. Our code is available at https://github.com/2014gaokao/ImageCaptioning.
引用
收藏
页码:213437 / 213446
页数:10
相关论文
共 50 条
  • [21] TRANSFORMER-BASED SAR IMAGE DESPECKLING
    Perera, Malsha V.
    Bandara, Wele Gedara Chaminda
    Valanarasu, Jeya Maria Jose
    Patel, Vishal M.
    2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 751 - 754
  • [22] Distance Transformer for Image Captioning
    Wang, Jiarong
    Lu, Tongwei
    Liu, Xuanxuan
    Yang, Qi
    2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
  • [23] Rotary Transformer for Image Captioning
    Qiu, Yile
    Zhu, Li
    SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328
  • [24] Entangled Transformer for Image Captioning
    Li, Guang
    Zhu, Linchao
    Liu, Ping
    Yang, Yi
    2019 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV 2019), 2019, : 8927 - 8936
  • [25] Boosted Transformer for Image Captioning
    Li, Jiangyun
    Yao, Peng
    Guo, Longteng
    Zhang, Weicun
    APPLIED SCIENCES-BASEL, 2019, 9 (16):
  • [26] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500
  • [27] Transformer-Based Approach to Melanoma Detection
    Cirrincione, Giansalvo
    Cannata, Sergio
    Cicceri, Giovanni
    Prinzi, Francesco
    Currieri, Tiziana
    Lovino, Marta
    Militello, Carmelo
    Pasero, Eros
    Vitabile, Salvatore
    SENSORS, 2023, 23 (12)
  • [28] Transformer-based approach to variable typing
    Rey, Charles Arthel
    Danguilan, Jose Lorenzo
    Mendoza, Karl Patrick
    Remolona, Miguel Francisco
    HELIYON, 2023, 9 (10)
  • [29] A Neural ODE and Transformer-based Model for Temporal Understanding and Dense Video Captioning
    Artham, Sainithin
    Shaikh, Soharab Hossain
    MULTIMEDIA TOOLS AND APPLICATIONS, 2024, 83 (23) : 64037 - 64056
  • [30] Transformer-based Extraction of Deep Image Models
    Battis, Verena
    Penner, Alexander
    2022 IEEE 7TH EUROPEAN SYMPOSIUM ON SECURITY AND PRIVACY (EUROS&P 2022), 2022, : 320 - 336