A Sparse Transformer-Based Approach for Image Captioning

被引:5
|
作者
Lei, Zhou [1 ]
Zhou, Congcong [1 ]
Chen, Shengbo [1 ]
Huang, Yiyong [1 ]
Liu, Xianrui [1 ]
机构
[1] Shanghai Univ, Sch Comp Engn & Sci, Shanghai 200444, Peoples R China
来源
IEEE ACCESS | 2020年 / 8卷
基金
中国国家自然科学基金;
关键词
Adaptation models; Decoding; Computer architecture; Sparse matrices; Visualization; Feature extraction; Task analysis; Image captioning; self-attention; explict sparse; local adaptive threshold;
D O I
10.1109/ACCESS.2020.3024639
中图分类号
TP [自动化技术、计算机技术];
学科分类号
0812 ;
摘要
Image Captioning is the task of providing a natural language description for an image. It has caught significant amounts of attention from both computer vision and natural language processing communities. Most image captioning models adopt deep encoder-decoder architectures to achieve state-of-the-art performances. However, it is difficult to model knowledge on relationships between input image region pairs in the encoder. Furthermore, the word in the decoder hardly knows the correlation to specific image regions. In this article, a novel deep encoder-decoder model is proposed for image captioning which is developed on sparse Transformer framework. The encoder adopts a multi-level representation of image features based on self-attention to exploit low-level and high-level features, naturally the correlations between image region pairs are adequately modeled as self-attention operation can be seen as a way of encoding pairwise relationships. The decoder improves the concentration of multi-head self-attention on the global context by explicitly selecting the most relevant segments at each row of the attention matrix. It can help the model focus on the more contributing image regions and generate more accurate words in the context. Experiments demonstrate that our model outperforms previous methods and achieves higher performance on MSCOCO and Flickr30k datasets. Our code is available at https://github.com/2014gaokao/ImageCaptioning.
引用
收藏
页码:213437 / 213446
页数:10
相关论文
共 50 条
  • [1] A Sparse Transformer-Based Approach for Image Captioning
    Lei, Zhou
    Zhou, Congcong
    Chen, Shengbo
    Huang, Yiyong
    Liu, Xianrui
    [J]. IEEE Access, 2020, 8 : 213437 - 213446
  • [2] ThaiTC:Thai Transformer-based Image Captioning
    Jaknamon, Teetouch
    Marukatat, Sanparith
    [J]. 2022 17TH INTERNATIONAL JOINT SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND NATURAL LANGUAGE PROCESSING (ISAI-NLP 2022) / 3RD INTERNATIONAL CONFERENCE ON ARTIFICIAL INTELLIGENCE AND INTERNET OF THINGS (AIOT 2022), 2022,
  • [3] A Review of Transformer-Based Approaches for Image Captioning
    Ondeng, Oscar
    Ouma, Heywood
    Akuon, Peter
    [J]. APPLIED SCIENCES-BASEL, 2023, 13 (19):
  • [4] Bornon: Bengali Image Captioning with Transformer-Based Deep Learning Approach
    Faisal Muhammad Shah
    Mayeesha Humaira
    Md Abidur Rahman Khan Jim
    Amit Saha Ami
    Shimul Paul
    [J]. SN Computer Science, 2022, 3 (1)
  • [5] Transformer-based image captioning by leveraging sentence information
    Chahkandi, Vahid
    Fadaeieslam, Mohammad Javad
    Yaghmaee, Farzin
    [J]. JOURNAL OF ELECTRONIC IMAGING, 2022, 31 (04)
  • [6] Transformer-based local-global guidance for image captioning
    Parvin, Hashem
    Naghsh-Nilchi, Ahmad Reza
    Mohammadi, Hossein Mahvash
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 223
  • [7] Image captioning using transformer-based double attention network
    Parvin, Hashem
    Naghsh-Nilchi, Ahmad Reza
    Mohammadi, Hossein Mahvash
    [J]. ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 125
  • [8] Explaining transformer-based image captioning models: An empirical analysis
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. AI COMMUNICATIONS, 2022, 35 (02) : 111 - 129
  • [9] Aware-Transformer: A Novel Pure Transformer-Based Model for Remote Sensing Image Captioning
    Cao, Yukun
    Yan, Jialuo
    Tang, Yijia
    He, Zhenyi
    Xu, Kangle
    Cheng, Yu
    [J]. ADVANCES IN COMPUTER GRAPHICS, CGI 2023, PT I, 2024, 14495 : 105 - 117
  • [10] A performance analysis of transformer-based deep learning models for Arabic image captioning
    Alsayed, Ashwaq
    Qadah, Thamir M.
    Arif, Muhammad
    [J]. JOURNAL OF KING SAUD UNIVERSITY-COMPUTER AND INFORMATION SCIENCES, 2023, 35 (09)