GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features

被引:60
|
作者
Van-Quang Nguyen [1 ]
Suganuma, Masanori [1 ,2 ]
Okatani, Takayuki [1 ,2 ]
机构
[1] Tohoku Univ, Grad Sch Informat Sci, Sendai, Miyagi, Japan
[2] RIKEN, Ctr AIP, Tokyo, Japan
来源
关键词
Image captioning; Grid features; Region features;
D O I
10.1007/978-3-031-20059-5_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted by an object detector such as Faster R-CNN. However, they have several issues, such as lack of contextual information, the risk of inaccurate detection, and the high computational cost. The first two could be resolved by additionally using grid-based features. However, how to extract and fuse these two types of features is uncharted. This paper proposes a Transformer-only neural architecture, dubbed GRIT (Grid- and Region-based Image captioning Transformer), that effectively utilizes the two visual features to generate better captions. GRIT replaces the CNN-based detector employed in previous methods with a DETR-based one, making it computationally faster. Moreover, its monolithic design consisting only of Transformers enables end-to-end training of the model. This innovative design and the integration of the dual visual features bring about significant performance improvement. The experimental results on several image captioning benchmarks show that GRIT outperforms previous methods in inference accuracy and speed.
引用
收藏
页码:167 / 184
页数:18
相关论文
共 50 条
  • [31] Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning
    Guo, Dandan
    Lu, Ruiying
    Chen, Bo
    Zeng, Zequn
    Zhou, Mingyuan
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (08) : 1920 - 1937
  • [32] DEVICE: Depth and Visual Concepts Aware Transformer for OCR-based image captioning
    Xu, Dongsheng
    Huang, Qingbao
    Zhang, Xingmao
    Cheng, Haonan
    Shuang, Feng
    Cai, Yi
    PATTERN RECOGNITION, 2025, 164
  • [33] Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning
    Li, Jingyu
    Mao, Zhendong
    Li, Hao
    Chen, Weidong
    Zhang, Yongdong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [34] Image captioning using transformer-based double attention network
    Parvin, Hashem
    Naghsh-Nilchi, Ahmad Reza
    Mohammadi, Hossein Mahvash
    ENGINEERING APPLICATIONS OF ARTIFICIAL INTELLIGENCE, 2023, 125
  • [35] A Novelty Framework in Image-Captioning with Visual Attention-Based Refined Visual Features
    Thobhani, Alaa
    Zou, Beiji
    Kui, Xiaoyan
    Abdussalam, Amr
    Asim, Muhammad
    Elaffendi, Mohammed
    Shah, Sajid
    CMC-COMPUTERS MATERIALS & CONTINUA, 2025, 82 (03): : 3943 - 3964
  • [36] Attention-Based Image Captioning Using DenseNet Features
    Hossain, Md Zakir
    Sohel, Ferdous
    Shiratuddin, Mohd Fairuz
    Laga, Hamid
    Bennamoun, Mohammed
    NEURAL INFORMATION PROCESSING, ICONIP 2019, PT V, 2019, 1143 : 109 - 117
  • [37] Image captioning model using attention and object features to mimic human image understanding
    Muhammad Abdelhadie Al-Malla
    Assef Jafar
    Nada Ghneim
    Journal of Big Data, 9
  • [38] Image Captioning using Hybrid LSTM-RNN with Deep Features
    Kalpana Prasanna Deorukhkar
    Satish Ket
    Sensing and Imaging, 2022, 23
  • [39] Image captioning model using attention and object features to mimic human image understanding
    Al-Malla, Muhammad Abdelhadie
    Jafar, Assef
    Ghneim, Nada
    JOURNAL OF BIG DATA, 2022, 9 (01)
  • [40] Image Captioning using Hybrid LSTM-RNN with Deep Features
    Deorukhkar, Kalpana Prasanna
    Ket, Satish
    SENSING AND IMAGING, 2022, 23 (01):