GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features

被引:60
|
作者
Van-Quang Nguyen [1 ]
Suganuma, Masanori [1 ,2 ]
Okatani, Takayuki [1 ,2 ]
机构
[1] Tohoku Univ, Grad Sch Informat Sci, Sendai, Miyagi, Japan
[2] RIKEN, Ctr AIP, Tokyo, Japan
来源
关键词
Image captioning; Grid features; Region features;
D O I
10.1007/978-3-031-20059-5_10
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Current state-of-the-art methods for image captioning employ region-based features, as they provide object-level information that is essential to describe the content of images; they are usually extracted by an object detector such as Faster R-CNN. However, they have several issues, such as lack of contextual information, the risk of inaccurate detection, and the high computational cost. The first two could be resolved by additionally using grid-based features. However, how to extract and fuse these two types of features is uncharted. This paper proposes a Transformer-only neural architecture, dubbed GRIT (Grid- and Region-based Image captioning Transformer), that effectively utilizes the two visual features to generate better captions. GRIT replaces the CNN-based detector employed in previous methods with a DETR-based one, making it computationally faster. Moreover, its monolithic design consisting only of Transformers enables end-to-end training of the model. This innovative design and the integration of the dual visual features bring about significant performance improvement. The experimental results on several image captioning benchmarks show that GRIT outperforms previous methods in inference accuracy and speed.
引用
收藏
页码:167 / 184
页数:18
相关论文
共 50 条
  • [41] Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
    Lee, Hojun
    Cho, Hyunjun
    Park, Jieun
    Chae, Jinyeong
    Kim, Jihie
    SENSORS, 2022, 22 (04)
  • [42] Dual-Modal Transformer with Enhanced Inter- and Intra-Modality Interactions for Image Captioning
    Kumar, Deepika
    Srivastava, Varun
    Popescu, Daniela Elena
    Hemanth, Jude D.
    APPLIED SCIENCES-BASEL, 2022, 12 (13):
  • [43] From grids to pseudo-regions: Dynamic memory augmented image captioning with dual relation transformer
    Zhou, Wei
    Jiang, Weitao
    Zheng, Zhijie
    Li, Jianchao
    Su, Tao
    Hu, Haifeng
    EXPERT SYSTEMS WITH APPLICATIONS, 2025, 273
  • [44] TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning
    Wu, Yinan
    Li, Lingling
    Jiao, Licheng
    Liu, Fang
    Liu, Xu
    Yang, Shuyuan
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [45] A Bottom-Up and Top-Down Approach for Image Captioning using Transformer
    Parameswaran, Sandeep Narayan
    Das, Sukhendu
    ELEVENTH INDIAN CONFERENCE ON COMPUTER VISION, GRAPHICS AND IMAGE PROCESSING (ICVGIP 2018), 2018,
  • [46] Improving scene text image captioning using transformer-based multilevel attention
    Srivastava, Swati
    Sharma, Himanshu
    JOURNAL OF ELECTRONIC IMAGING, 2023, 32 (03)
  • [47] From multi-scale grids to dynamic regions: Dual-relation enhanced transformer for image captioning
    Zhou, Wei
    Song, Chuanle
    Chen, Dihu
    Su, Tao
    Hu, Haifeng
    Shan, Chun
    KNOWLEDGE-BASED SYSTEMS, 2025, 311
  • [48] Web image concept annotation with better understanding of tags and visual features
    Gao, Shenghua
    Chia, Liang-Tien
    Cheng, Xiangang
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2010, 21 (08) : 806 - 814
  • [49] When Visual Object-Context Features Meet Generic and Specific Semantic Priors in Image Captioning
    Liu, Heng
    Tian, Chunna
    Jiang, Mengmeng
    TENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2018), 2019, 11069
  • [50] End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration
    Song J.-K.
    Zeng P.-P.
    Gu J.-Y.
    Zhu J.-K.
    Gao L.-L.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2152 - 2169