An Attentive Fourier-Augmented Image-Captioning Transformer

被引:5
|
作者
Osolo, Raymond Ian [1 ,2 ]
Yang, Zhan [2 ,3 ]
Long, Jun [2 ,3 ]
机构
[1] Cent South Univ, Sch Informat Sci & Engn, Changsha 410083, Peoples R China
[2] Cent South Univ, Network Resources Management & Trust Evaluat Key, Changsha 410083, Peoples R China
[3] Cent South Univ, Big Data Inst, Changsha 410083, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 18期
基金
中国国家自然科学基金;
关键词
image-captioning; deep learning; transformers;
D O I
10.3390/app11188354
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Many vision-language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persisted even with the emergence of transformer-based architectures as the preferred base architecture of recent state-of-the-art vision-language models. In this paper, we make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. This is achieved by performing a 1D Fourier transformation on the image features first in the hidden dimension and then in the sequence dimension. These extracted features alongside the region proposal image features result in a richer image representation that can then be queried to produce the associated captions, which showcase a deeper understanding of image-object-location relationships than similar models. Extensive experiments performed on the MSCOCO dataset demonstrate a CIDER-D, BLEU-1, and BLEU-4 score of 130, 80.5, and 39, respectively, on the MSCOCO benchmark dataset.
引用
收藏
页数:22
相关论文
共 50 条
  • [21] Complementary Shifted Transformer for Image Captioning
    Yanbo Liu
    You Yang
    Ruoyu Xiang
    Jixin Ma
    [J]. Neural Processing Letters, 2023, 55 : 8339 - 8363
  • [22] ETransCap: efficient transformer for image captioning
    Mundu, Albert
    Singh, Satish Kumar
    Dubey, Shiv Ram
    [J]. APPLIED INTELLIGENCE, 2024, 54 (21) : 10748 - 10762
  • [23] Direction Relation Transformer for Image Captioning
    Song, Zeliang
    Zhou, Xiaofei
    Dong, Linhua
    Tan, Jianlong
    Guo, Li
    [J]. PROCEEDINGS OF THE 29TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2021, 2021, : 5056 - 5064
  • [24] ReFormer: The Relational Transformer for Image Captioning
    Yang, Xuewen
    Liu, Yingru
    Wang, Xin
    [J]. PROCEEDINGS OF THE 30TH ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2022, 2022, : 5398 - 5406
  • [25] Text with Knowledge Graph Augmented Transformer for Video Captioning
    Gu, Xin
    Chen, Guang
    Wang, Yufei
    Zhang, Libo
    Luo, Tiejian
    Wen, Longyin
    [J]. 2023 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR), 2023, : 18941 - 18951
  • [26] Fourier Image Transformer
    Buchholz, Tim-Oliver
    Jug, Florian
    [J]. 2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION WORKSHOPS, CVPRW 2022, 2022, : 1845 - 1853
  • [27] Image and Video Captioning with Augmented Neural Architectures
    Shetty, Rakshith
    Tavakoli, Hamed R.
    Laaksonen, Jorma
    [J]. IEEE MULTIMEDIA, 2018, 25 (02) : 34 - 46
  • [28] Relational-Convergent Transformer for image captioning
    Chen, Lizhi
    Yang, You
    Hu, Juntao
    Pan, Longyue
    Zhai, Hao
    [J]. DISPLAYS, 2023, 77
  • [29] MIXED KNOWLEDGE RELATION TRANSFORMER FOR IMAGE CAPTIONING
    Chen, Tianyu
    Li, Zhixin
    Wei, Jiahui
    Xian, Tiantao
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4403 - 4407
  • [30] A Position-Aware Transformer for Image Captioning
    Deng, Zelin
    Zhou, Bo
    He, Pei
    Huang, Jianfeng
    Alfarraj, Osama
    Tolba, Amr
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 70 (01): : 2065 - 2081