An Attentive Fourier-Augmented Image-Captioning Transformer

被引:5
|
作者
Osolo, Raymond Ian [1 ,2 ]
Yang, Zhan [2 ,3 ]
Long, Jun [2 ,3 ]
机构
[1] Cent South Univ, Sch Informat Sci & Engn, Changsha 410083, Peoples R China
[2] Cent South Univ, Network Resources Management & Trust Evaluat Key, Changsha 410083, Peoples R China
[3] Cent South Univ, Big Data Inst, Changsha 410083, Peoples R China
来源
APPLIED SCIENCES-BASEL | 2021年 / 11卷 / 18期
基金
中国国家自然科学基金;
关键词
image-captioning; deep learning; transformers;
D O I
10.3390/app11188354
中图分类号
O6 [化学];
学科分类号
0703 ;
摘要
Many vision-language models that output natural language, such as image-captioning models, usually use image features merely for grounding the captions and most of the good performance of the model can be attributed to the language model, which does all the heavy lifting, a phenomenon that has persisted even with the emergence of transformer-based architectures as the preferred base architecture of recent state-of-the-art vision-language models. In this paper, we make the images matter more by using fast Fourier transforms to further breakdown the input features and extract more of their intrinsic salient information, resulting in more detailed yet concise captions. This is achieved by performing a 1D Fourier transformation on the image features first in the hidden dimension and then in the sequence dimension. These extracted features alongside the region proposal image features result in a richer image representation that can then be queried to produce the associated captions, which showcase a deeper understanding of image-object-location relationships than similar models. Extensive experiments performed on the MSCOCO dataset demonstrate a CIDER-D, BLEU-1, and BLEU-4 score of 130, 80.5, and 39, respectively, on the MSCOCO benchmark dataset.
引用
收藏
页数:22
相关论文
共 50 条
  • [41] Input enhanced asymmetric transformer for image captioning
    Zhu, Chenhao
    Ye, Xia
    Lu, Qiduo
    [J]. SIGNAL IMAGE AND VIDEO PROCESSING, 2023, 17 (04) : 1419 - 1427
  • [42] Improved Transformer with Parallel Encoders for Image Captioning
    Lou, Liangshan
    Lu, Ke
    Xue, Jian
    [J]. 2022 26TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION (ICPR), 2022, : 4072 - 4078
  • [43] Semi-Autoregressive Transformer for Image Captioning
    Zhou, Yuanen
    Zhang, Yong
    Hu, Zhenzhen
    Wang, Meng
    [J]. 2021 IEEE/CVF INTERNATIONAL CONFERENCE ON COMPUTER VISION WORKSHOPS (ICCVW 2021), 2021, : 3132 - 3136
  • [44] HIST: Hierarchical and sequential transformer for image captioning
    Lv, Feixiao
    Wang, Rui
    Jing, Lihua
    Dai, Pengwen
    [J]. IET COMPUTER VISION, 2024, : 1043 - 1056
  • [45] Interaction augmented transformer with decoupled decoding for video captioning q
    Jin, Tao
    Zhao, Zhou
    Wang, Peng
    Yu, Jun
    Wu, Fei
    [J]. NEUROCOMPUTING, 2022, 492 : 496 - 507
  • [46] An Augmented Image Captioning Model: Incorporating Hierarchical Image Information
    Funckes, Nathan
    Carrier, Erin
    Wolffe, Greg
    [J]. 20TH IEEE INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND APPLICATIONS (ICMLA 2021), 2021, : 1608 - 1614
  • [47] Incorporating attentive multi-scale context information for image captioning
    Jeripothula Prudviraj
    Yenduri Sravani
    C. Krishna Mohan
    [J]. Multimedia Tools and Applications, 2023, 82 : 10017 - 10037
  • [48] Incorporating attentive multi-scale context information for image captioning
    Prudviraj, Jeripothula
    Sravani, Yenduri
    Mohan, C. Krishna
    [J]. MULTIMEDIA TOOLS AND APPLICATIONS, 2023, 82 (07) : 10017 - 10037
  • [49] Hybrid of Deep Learning and Word Embedding in Generating Captions: Image-Captioning Solution for Geological Rock Images
    Nursikuwagus, Agus
    Munir, Rinaldi
    Khodra, Masayu Leylia
    [J]. JOURNAL OF IMAGING, 2022, 8 (11)
  • [50] Efficient Image Captioning Based on Vision Transformer Models
    Elbedwehy, Samar
    Medhat, T.
    Hamza, Taher
    Alrahmawy, Mohammed F.
    [J]. CMC-COMPUTERS MATERIALS & CONTINUA, 2022, 73 (01): : 1483 - 1500