LG-MLFormer: local and global MLP for image captioning

被引:2
|
作者
Jiang, Zetao [1 ]
Wang, Xiuxian [1 ]
Zhai, Zhongyi [1 ]
Cheng, Bo [2 ]
机构
[1] Guilin Univ Elect Technol, Sch Comp Sci & Informat Secur, Guilin 541004, Peoples R China
[2] Beijing Univ Posts & Telecommun, State Key Lab Networking & Switching Technol, Beijing 100876, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Reinforcement learning; Artificial intelligence; Transformer; Multi-layer perceptrons;
D O I
10.1007/s13735-023-00266-9
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Self-attention-based image captioning model exists visual features' spatial information loss problem, introducing relative position encoding can solve the problem to some extent. However, it will bring additional parameters and greater computational complexity. To solve the above problem, we propose a novel local-global MLFormer (LG-MLFormer) with specifically designed encoder module Local-global multi-layer perceptron (LG-MLP). The LG-MLP can capture the latent correlations between different images and its linear stacking calculation mode can reduce computational complexity. It consists of two independent local MLP (LM) modules and a cross-domain global MLP (CDGM) module. The LM specially designs the mapping dimension between linear layers to realize the self-compensation of visual features' spatial information without introducing relative position encoding. The CDGM module aggregates cross-domain potential correlations between grid-based features and region-based features to realize the complementary advantages of these global and local semantic associations. Experiments on the Karpathy test split and the online test server reveal that our approach provides superior or comparable performance to the state-of-the-art (SOTA).
引用
收藏
页数:13
相关论文
共 50 条
  • [1] LG-MLFormer: local and global MLP for image captioning
    Zetao Jiang
    Xiuxian Wang
    Zhongyi Zhai
    Bo Cheng
    [J]. International Journal of Multimedia Information Retrieval, 2023, 12
  • [2] MODELING LOCAL AND GLOBAL CONTEXTS FOR IMAGE CAPTIONING
    Yao, Peng
    Li, Jiangyun
    Guo, Longteng
    Liu, Jing
    [J]. 2020 IEEE INTERNATIONAL CONFERENCE ON MULTIMEDIA AND EXPO (ICME), 2020,
  • [3] GLCM: Global-Local Captioning Model for Remote Sensing Image Captioning
    Wang, Qi
    Huang, Wei
    Zhang, Xueting
    Li, Xuelong
    [J]. IEEE TRANSACTIONS ON CYBERNETICS, 2023, 53 (11) : 6910 - 6922
  • [4] Local-to-Global Semantic Supervised Learning for Image Captioning
    Wang, Juan
    Duan, Yiping
    Tao, Xiaoming
    Lu, Jianhua
    [J]. ICC 2020 - 2020 IEEE INTERNATIONAL CONFERENCE ON COMMUNICATIONS (ICC), 2020,
  • [5] CSTNET: ENHANCING GLOBAL-TO-LOCAL INTERACTIONS FOR IMAGE CAPTIONING
    Yang, Xin
    Wang, Ying
    Chen, Haishun
    Li, Jie
    [J]. 2022 IEEE INTERNATIONAL CONFERENCE ON IMAGE PROCESSING, ICIP, 2022, : 1861 - 1865
  • [6] Local-global visual interaction attention for image captioning
    Wang, Changzhi
    Gu, Xiaodong
    [J]. DIGITAL SIGNAL PROCESSING, 2022, 130
  • [7] Transformer-based local-global guidance for image captioning
    Parvin, Hashem
    Naghsh-Nilchi, Ahmad Reza
    Mohammadi, Hossein Mahvash
    [J]. EXPERT SYSTEMS WITH APPLICATIONS, 2023, 223
  • [8] Image captioning based on global-local feature and adaptive-attention
    Zhao X.-H.
    Yin L.-F.
    Zhao C.-L.
    [J]. Zhejiang Daxue Xuebao (Gongxue Ban)/Journal of Zhejiang University (Engineering Science), 2020, 54 (01): : 126 - 134
  • [9] Fine-Grained Image Captioning With Global-Local Discriminative Objective
    Wu, Jie
    Chen, Tianshui
    Wu, Hefeng
    Yang, Zhi
    Luo, Guangchun
    Lin, Liang
    [J]. IEEE TRANSACTIONS ON MULTIMEDIA, 2021, 23 : 2413 - 2427
  • [10] Attention-guided image captioning with adaptive global and local feature fusion
    Zhong, Xian
    Nie, Guozhang
    Huang, Wenxin
    Liu, Wenxuan
    Ma, Bo
    Lin, Chia-Wen
    [J]. JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2021, 78