Exploring refined dual visual features cross-combination for image captioning

被引:0
|
作者
Hu, Junbo [1 ,2 ]
Li, Zhixin [1 ,2 ]
Su, Qiang [1 ,2 ]
Tang, Zhenjun [1 ,2 ]
Ma, Huifang [3 ]
机构
[1] Guangxi Normal Univ, Key Lab Educ Blockchain & Intelligent Technol, Minist Educ, Guilin 541004, Peoples R China
[2] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
[3] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Cross Combination; Contrastive Language-Image Pre-Training; Reinforcement learning; VIDEO;
D O I
10.1016/j.neunet.2024.106710
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.
引用
收藏
页数:13
相关论文
共 50 条
  • [1] A Novelty Framework in Image-Captioning with Visual Attention-Based Refined Visual Features
    Thobhani, Alaa
    Zou, Beiji
    Kui, Xiaoyan
    Abdussalam, Amr
    Asim, Muhammad
    Elaffendi, Mohammed
    Shah, Sajid
    CMC-COMPUTERS MATERIALS & CONTINUA, 2025, 82 (03): : 3943 - 3964
  • [2] Exploring Visual Relationship for Image Captioning
    Yao, Ting
    Pan, Yingwei
    Li, Yehao
    Mei, Tao
    COMPUTER VISION - ECCV 2018, PT XIV, 2018, 11218 : 711 - 727
  • [3] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    Journal of Intelligent and Fuzzy Systems, 2024, 46 (02): : 3447 - 3459
  • [4] RVAIC: Refined visual attention for improved image captioning
    Al-Qatf, Majjed
    Hawbani, Ammar
    Wang, XingFu
    Abdusallam, Amr
    Alsamhi, Saeed
    Alhabib, Mohammed
    Curry, Edward
    JOURNAL OF INTELLIGENT & FUZZY SYSTEMS, 2024, 46 (02) : 3447 - 3459
  • [5] GRIT: Faster and Better Image Captioning Transformer Using Dual Visual Features
    Van-Quang Nguyen
    Suganuma, Masanori
    Okatani, Takayuki
    COMPUTER VISION, ECCV 2022, PT XXXVI, 2022, 13696 : 167 - 184
  • [6] Exploring better image captioning with grid features
    Jie Yan
    Yuxiang Xie
    Yanming Guo
    Yingmei Wei
    Xidao Luan
    Complex & Intelligent Systems, 2024, 10 : 3541 - 3556
  • [7] Exploring better image captioning with grid features
    Yan, Jie
    Xie, Yuxiang
    Guo, Yanming
    Wei, Yingmei
    Luan, Xidao
    COMPLEX & INTELLIGENT SYSTEMS, 2024, 10 (03) : 3541 - 3556
  • [8] Geometrically-Aware Dual Transformer Encoding Visual and Textual Features for Image Captioning
    Chang, Yu-Ling
    Ma, Hao-Shang
    Li, Shiou-Chi
    Huang, Jen-Wei
    ADVANCES IN KNOWLEDGE DISCOVERY AND DATA MINING, PT V, PAKDD 2024, 2024, 14649 : 15 - 27
  • [9] Exploring region features in remote sensing image captioning
    Zhao, Kai
    Xiong, Wei
    INTERNATIONAL JOURNAL OF APPLIED EARTH OBSERVATION AND GEOINFORMATION, 2024, 127
  • [10] EXPLORING DUAL STREAM GLOBAL INFORMATION FOR IMAGE CAPTIONING
    Xian, Tiantao
    Li, Zhixin
    Chen, Tianyu
    Ma, Huifang
    2022 IEEE INTERNATIONAL CONFERENCE ON ACOUSTICS, SPEECH AND SIGNAL PROCESSING (ICASSP), 2022, : 4458 - 4462