An Attentive Fourier-Augmented Image-Captioning Transformer

被引:5
|
作者
Osolo, Raymond Ian [1 ,2 ]
Yang, Zhan [2 ,3 ]
Long, Jun [2 ,3 ]
机构
[1] Cent South Univ, Sch Informat Sci & Engn, Changsha 410083, Peoples R China
[2] Cent South Univ, Network Resources Management & Trust Evaluat Key, Changsha 410083, Peoples R China
[3] Cent South Univ, Big Data Inst, Changsha 410083, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 18期
基金
中国国家自然科学基金;
关键词
image-captioning; deep learning; transformers;
D O I
10.3390/app11188354
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Many vision-language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persisted even with the emergence of transformer-based architectures as the preferred base architecture of recent state-of-the-art vision-language models. In this paper, we make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. This is achieved by performing a 1D Fourier transformation on the image features first in the hidden dimension and then in the sequence dimension. These extracted features alongside the region proposal image features result in a richer image representation that can then be queried to produce the associated captions, which showcase a deeper understanding of image-object-location relationships than similar models. Extensive experiments performed on the MSCOCO dataset demonstrate a CIDER-D, BLEU-1, and BLEU-4 score of 130, 80.5, and 39, respectively, on the MSCOCO benchmark dataset.
引用
收藏
页数:22
相关论文
共 50 条
  • [1] Image-Captioning Model Compression
    Atliha, Viktar
    Sesok, Dmitrij
    [J]. APPLIED SCIENCES-BASEL, 2022, 12 (03):
  • [2] Retrieval-Augmented Transformer for Image Captioning
    Sarto, Sara
    Cornia, Marcella
    Baraldi, Lorenzo
    Cucchiara, Rita
    [J]. 19TH INTERNATIONAL CONFERENCE ON CONTENT-BASED MULTIMEDIA INDEXING, CBMI 2022, 2022, : 1 - 7
  • [3] Visual contextual relationship augmented transformer for image captioning
    Su, Qiang
    Hu, Junbo
    Li, Zhixin
    [J]. APPLIED INTELLIGENCE, 2024, 54 (06) : 4794 - 4813
  • [4] Visual contextual relationship augmented transformer for image captioning
    Qiang Su
    Junbo Hu
    Zhixin Li
    [J]. Applied Intelligence, 2024, 54 : 4794 - 4813
  • [5] Attentive Linear Transformation for Image Captioning
    Ye, Senmao
    Han, Junwei
    Liu, Nian
    [J]. IEEE TRANSACTIONS ON IMAGE PROCESSING, 2018, 27 (11) : 5514 - 5524
  • [6] PAIC: Parallelised Attentive Image Captioning
    Wang, Ziwei
    Huang, Zi
    Luo, Yadan
    [J]. DATABASES THEORY AND APPLICATIONS, ADC 2020, 2020, 12008 : 16 - 28
  • [7] Attentive Contextual Network for Image Captioning
    Prudviraj, Jeripothula
    Vishnu, Chalavadi
    Mohan, C. Krishna
    [J]. 2021 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS (IJCNN), 2021,
  • [8] MAT: A Multimodal Attentive Translator for Image Captioning
    Liu, Chang
    Sun, Fuchun
    Wang, Changhu
    Wang, Feng
    Yuille, Alan
    [J]. PROCEEDINGS OF THE TWENTY-SIXTH INTERNATIONAL JOINT CONFERENCE ON ARTIFICIAL INTELLIGENCE, 2017, : 4033 - 4039
  • [9] Distance Transformer for Image Captioning
    Wang, Jiarong
    Lu, Tongwei
    Liu, Xuanxuan
    Yang, Qi
    [J]. 2021 4TH INTERNATIONAL CONFERENCE ON ROBOTICS, CONTROL AND AUTOMATION ENGINEERING (RCAE 2021), 2021, : 73 - 76
  • [10] Rotary Transformer for Image Captioning
    Qiu, Yile
    Zhu, Li
    [J]. SECOND INTERNATIONAL CONFERENCE ON OPTICS AND IMAGE PROCESSING (ICOIP 2022), 2022, 12328