Transformer-based local-global guidance for image captioning

被引:11
|
作者
Parvin, Hashem [1 ]
Naghsh-Nilchi, Ahmad Reza [1 ]
Mohammadi, Hossein Mahvash [1 ]
机构
[1] Univ Isfahan, Fac Comp Engn, Dept Artificial Intelligence, Esfahan, Iran
关键词
Attention; Transformer; Encoder; -decoder; Image captioning; Deep learning; ATTENTION; NETWORK; GENERATION;
D O I
10.1016/j.eswa.2023.119774
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Image captioning is a difficult problem for machine learning algorithms to compress huge amounts of images into descriptive languages. The recurrent models are popularly used as the decoder to extract the caption with sig-nificant performance, while these models have complicated and inherently sequential overtime issues. Recently, transformers provide modeling long dependencies and support parallel processing of sequences compared to recurrent models. However, recent transformer-based models assign attention weights to all candidate vectors based on the assumption that all vectors are relevant and ignore the intra-object relationships. Besides, the complex relationships between key and query vectors cannot be provided using a single attention mechanism. In this paper, a new transformer-based image captioning structure without recurrence and convolution is proposed to address these issues. To this end, a generator network and a selector network to generate textual descriptions collaboratively are designed. Our work contains three main steps: (1) Design a transformer-based generator network as word-level guidance to generate next words based on the current state. (2) Train a latent space to learn the mapping of captions and images into the same embedding space to learn the text-image relation. (3) Design a selector network as sentence-level guidance to evaluate next words by assigning fitness scores to the partial captions through the embedding space. Compared with the architecture of existing methods, the proposed approach contains an attention mechanism without the dependencies of time. It executes each state to select the next best word using local-global guidance. In addition, the proposed model maintains dependencies between the sequences, and can be trained in parallel. Several experiments on the COCO and Flickr datasets demonstrate that the proposed approach can outperform various state-of-the-art models over well-known evaluation measures.
引用
收藏
页数:20
相关论文
共 50 条
  • [31] An Analysis of the Use of Feed-Forward Sub-Modules for Transformer-Based Image Captioning Tasks
    Osolo, Raymond Ian
    Yang, Zhan
    Long, Jun
    [J]. APPLIED SCIENCES-BASEL, 2021, 11 (24):
  • [32] MODELING LOCAL AND GLOBAL CONTEXTS FOR IMAGE CAPTIONING
    Yao, Peng
    Li, Jiangyun
    Guo, Longteng
    Liu, Jing
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [33] TRANSFORMER-BASED SAR IMAGE DESPECKLING
    Perera, Malsha V.
    Bandara, Wele Gedara Chaminda
    Valanarasu, Jeya Maria Jose
    Patel, Vishal M.
    [J]. 2022 IEEE INTERNATIONAL GEOSCIENCE AND REMOTE SENSING SYMPOSIUM (IGARSS 2022), 2022, : 751 - 754
  • [34] DualFormer: Local-Global Stratified Transformer for Efficient Video Recognition
    Liang, Yuxuan
    Zhou, Pan
    Zimmermann, Roger
    Yan, Shuicheng
    [J]. COMPUTER VISION, ECCV 2022, PT XXXIV, 2022, 13694 : 577 - 595
  • [35] Local-Global Transformer Neural Network for temporal action segmentation
    Tian, Xiaoyan
    Jin, Ye
    Tang, Xianglong
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (02) : 615 - 626
  • [36] Landslide Susceptibility Mapping Considering Landslide Local-Global Features Based on CNN and Transformer
    Zhao, Zeyang
    Chen, Tao
    Dou, Jie
    Liu, Gang
    Plaza, Antonio
    [J]. IEEE JOURNAL OF SELECTED TOPICS IN APPLIED EARTH OBSERVATIONS AND REMOTE SENSING, 2024, 17 : 7475 - 7489
  • [37] A Local-Global Interactive Vision Transformer for Aerial Scene Classification
    Peng, Ting
    Yi, Jingjun
    Fang, Yuan
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20
  • [38] Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
    Lee, Hojun
    Cho, Hyunjun
    Park, Jieun
    Chae, Jinyeong
    Kim, Jihie
    [J]. SENSORS, 2022, 22 (04)
  • [39] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922
  • [40] Weakly Supervised Local-Global Anchor Guidance Network for Landslide Extraction With Image-Level Annotations
    Zhang, Xiaokang
    Yu, Weikang
    Ma, Xianping
    Kang, Xudong
    [J]. IEEE GEOSCIENCE AND REMOTE SENSING LETTERS, 2023, 20