Exploring refined dual visual features cross-combination for image captioning

被引:0
|
作者
Hu, Junbo [1 ,2 ]
Li, Zhixin [1 ,2 ]
Su, Qiang [1 ,2 ]
Tang, Zhenjun [1 ,2 ]
Ma, Huifang [3 ]
机构
[1] Guangxi Normal Univ, Key Lab Educ Blockchain & Intelligent Technol, Minist Educ, Guilin 541004, Peoples R China
[2] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
[3] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Cross Combination; Contrastive Language-Image Pre-Training; Reinforcement learning; VIDEO;
D O I
10.1016/j.neunet.2024.106710
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.
引用
收藏
页数:13
相关论文
共 50 条
  • [21] Matching Visual Features to Hierarchical Semantic Topics for Image Paragraph Captioning
    Guo, Dandan
    Lu, Ruiying
    Chen, Bo
    Zeng, Zequn
    Zhou, Mingyuan
    INTERNATIONAL JOURNAL OF COMPUTER VISION, 2022, 130 (08) : 1920 - 1937
  • [22] CropCap: Embedding Visual Cross-Partition Dependency for Image Captioning
    Wang, Bo
    Zhang, Zhao
    Zhao, Suiyi
    Zhang, Haijun
    Hong, Richang
    Wang, Meng
    PROCEEDINGS OF THE 31ST ACM INTERNATIONAL CONFERENCE ON MULTIMEDIA, MM 2023, 2023, : 1750 - 1758
  • [23] Comparison and combination of textual and visual features for interactive cross-language image retrieval
    Cheng, PC
    Yeh, JY
    Ke, HR
    Chien, BC
    Yang, WP
    MULTILINGUAL INFORMATION ACCESS FOR TEXT, SPEECH AND IMAGES, 2005, 3491 : 793 - 804
  • [24] Dual-adaptive interactive transformer with textual and visual context for image captioning
    Chen, Lizhi
    Li, Kesen
    EXPERT SYSTEMS WITH APPLICATIONS, 2024, 243
  • [25] Image Classification Based on the Combination of Text Features and Visual Features
    Tian, Lexiao
    Zheng, Dequan
    Zhu, Conghui
    INTERNATIONAL JOURNAL OF INTELLIGENT SYSTEMS, 2013, 28 (03) : 242 - 256
  • [26] A Novel Cross-Fusion Method of Different Types of Features for Image Captioning
    Lou, Liangshan
    Lu, Ke
    Xue, Jian
    2023 INTERNATIONAL JOINT CONFERENCE ON NEURAL NETWORKS, IJCNN, 2023,
  • [27] Exploring Visual Relationships via Transformer-based Graphs for Enhanced Image Captioning
    Li, Jingyu
    Mao, Zhendong
    Li, Hao
    Chen, Weidong
    Zhang, Yongdong
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (05)
  • [28] Exploring Implicit and Explicit Relations with the Dual Relation-Aware Network for Image Captioning
    Zha, Zhiwei
    Zhou, Pengfei
    Bai, Cong
    MULTIMEDIA MODELING, MMM 2022, PT II, 2022, 13142 : 97 - 108
  • [29] GRPIC: an end-to-end image captioning model using three visual features
    Peng, Shixin
    Xiong, Can
    Liu, Leyuan
    Yang, Laurence T.
    Chen, Jingying
    INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2025, 16 (03) : 1559 - 1572
  • [30] Visual-linguistic-stylistic Triple Reward for Cross-lingual Image Captioning
    Zhang, Jing
    Guo, Dan
    Yang, Xun
    Song, Peipei
    Wang, Meng
    ACM TRANSACTIONS ON MULTIMEDIA COMPUTING COMMUNICATIONS AND APPLICATIONS, 2024, 20 (04)