GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features

被引:60
|
作者
Van-Quang Nguyen [1 ]
Suganuma, Masanori [1 ,2 ]
Okatani, Takayuki [1 ,2 ]
机构
[1] Tohoku Univ, Grad Sch Informat Sci, Sendai, Miyagi, Japan
[2] RIKEN, Ctr AIP, Tokyo, Japan
来源
关键词
Image captioning; Grid features; Region features;
D O I
10.1007/978-3-031-20059-5_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted by an object detector such as Faster R-CNN. However, they have several issues, such as lack of contextual information, the risk of inaccurate detection, and the high computational cost. The first two could be resolved by additionally using grid-based features. However, how to extract and fuse these two types of features is uncharted. This paper proposes a Transformer-only neural architecture, dubbed GRIT (Grid- and Region-based Image captioning Transformer), that effectively utilizes the two visual features to generate better captions. GRIT replaces the CNN-based detector employed in previous methods with a DETR-based one, making it computationally faster. Moreover, its monolithic design consisting only of Transformers enables end-to-end training of the model. This innovative design and the integration of the dual visual features bring about significant performance improvement. The experimental results on several image captioning benchmarks show that GRIT outperforms previous methods in inference accuracy and speed.
引用
收藏
页码:167 / 184
页数:18
相关论文
共 50 条
  • [1] Geometrically-Aware Dual Transformer Encoding Visual and Textual Features for Image Captioning
    Chang, Yu-Ling
    Ma, Hao-Shang
    Li, Shiou-Chi
    Huang, Jen-Wei
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT V, PAKDD 2024, 2024, 14649 : 15 - 27
  • [2] Dual-visual collaborative enhanced transformer for image captioning
    Mou, Zhenping
    Song, Tianqi
    Luo, Hong
    MULTIMEDIA SYSTEMS, 2025, 31 (02)
  • [3] Image Captioning using Visual Attention and Detection Transformer Model
    Eluri, Yaswanth
    Vinutha, N.
    Jeevika, M.
    Sree, Sai Bhavya N.
    Abhiram, G. Surya
    10TH INTERNATIONAL CONFERENCE ON ELECTRONICS, COMPUTING AND COMMUNICATION TECHNOLOGIES, CONECCT 2024, 2024,
  • [4] Dual-adaptive interactive transformer with textual and visual context for image captioning
    Chen, Lizhi
    Li, Kesen
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 243
  • [5] Dual Global Enhanced Transformer for image captioning
    Xian, Tiantao
    Li, Zhixin
    Zhang, Canlong
    Ma, Huifang
    NEURAL NETWORKS, 2022, 148 : 129 - 141
  • [6] Dual Position Relationship Transformer for Image Captioning
    Wang, Yaohan
    Qian, Wenhua
    Nie, Rencan
    Xu, Dan
    Cao, Jinde
    Kim, Pyoungwon
    BIG DATA, 2022, 10 (06) : 515 - 527
  • [7] Improving Stylized Image Captioning with Better Use of Transformer
    Tan, Yutong
    Lin, Zheng
    Liu, Huan
    Zuo, Fan
    ARTIFICIAL NEURAL NETWORKS AND MACHINE LEARNING - ICANN 2022, PT III, 2022, 13531 : 347 - 358
  • [8] Exploring better image captioning with grid features
    Jie Yan
    Yuxiang Xie
    Yanming Guo
    Yingmei Wei
    Xidao Luan
    Complex & Intelligent Systems, 2024, 10 : 3541 - 3556
  • [9] Exploring better image captioning with grid features
    Yan, Jie
    Xie, Yuxiang
    Guo, Yanming
    Wei, Yingmei
    Luan, Xidao
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3541 - 3556
  • [10] Exploring refined dual visual features cross-combination for image captioning
    Hu, Junbo
    Li, Zhixin
    Su, Qiang
    Tang, Zhenjun
    Ma, Huifang
    NEURAL NETWORKS, 2024, 180