Exploring refined dual visual features cross-combination for image captioning

被引:0
|
作者
Hu, Junbo [1 ,2 ]
Li, Zhixin [1 ,2 ]
Su, Qiang [1 ,2 ]
Tang, Zhenjun [1 ,2 ]
Ma, Huifang [3 ]
机构
[1] Guangxi Normal Univ, Key Lab Educ Blockchain & Intelligent Technol, Minist Educ, Guilin 541004, Peoples R China
[2] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
[3] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Cross Combination; Contrastive Language-Image Pre-Training; Reinforcement learning; VIDEO;
D O I
10.1016/j.neunet.2024.106710
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.
引用
收藏
页数:13
相关论文
共 50 条
  • [31] Robust Hand Tracking with Refined CAMShift Based on Combination of Depth and Image Features
    Cui, Wenhuan
    Wang, Wenmin
    Liu, Hong
    2012 IEEE INTERNATIONAL CONFERENCE ON ROBOTICS AND BIOMIMETICS (ROBIO 2012), 2012,
  • [32] Manifold-Based Combination of Visual Features and Keyword Features for Image Retrieval
    Li, Jing
    Liu, Fuqiang
    Li, Zhipeng
    Cui, Jianzhu
    PROCEEDINGS OF THE 2009 WRI GLOBAL CONGRESS ON INTELLIGENT SYSTEMS, VOL III, 2009, : 554 - 558
  • [33] When Visual Object-Context Features Meet Generic and Specific Semantic Priors in Image Captioning
    Liu, Heng
    Tian, Chunna
    Jiang, Mengmeng
    TENTH INTERNATIONAL CONFERENCE ON GRAPHICS AND IMAGE PROCESSING (ICGIP 2018), 2019, 11069
  • [34] End-to-end Image Captioning via Visual Region Aggregation and Dual-level Collaboration
    Song J.-K.
    Zeng P.-P.
    Gu J.-Y.
    Zhu J.-K.
    Gao L.-L.
    Ruan Jian Xue Bao/Journal of Software, 2023, 34 (05): : 2152 - 2169
  • [35] Tagging Image by Exploring Weighted Correlation between Visual Features and Tags
    Zhang, Xiaoming
    Li, Zhoujun
    Long, Yun
    WEB-AGE INFORMATION MANAGEMENT, 2011, 6897 : 277 - +
  • [36] A novel method of image retrieval based on combination of semantic and visual features
    Li, M
    Wang, T
    Zhang, BW
    Ye, BC
    FUZZY SYSTEMS AND KNOWLEDGE DISCOVERY, PT 1, PROCEEDINGS, 2005, 3613 : 619 - 628
  • [37] Affective Image Captioning for Visual Artworks Using Emotion-Based Cross-Attention Mechanisms
    Ishikawa, Shintaro
    Sugiura, Komei
    IEEE ACCESS, 2023, 11 : 24527 - 24534
  • [38] Cross Encoder-Decoder Transformer with Global-Local Visual Extractor for Medical Image Captioning
    Lee, Hojun
    Cho, Hyunjun
    Park, Jieun
    Chae, Jinyeong
    Kim, Jihie
    SENSORS, 2022, 22 (04)
  • [39] TrTr-CMR: Cross-Modal Reasoning Dual Transformer for Remote Sensing Image Captioning
    Wu, Yinan
    Li, Lingling
    Jiao, Licheng
    Liu, Fang
    Liu, Xu
    Yang, Shuyuan
    IEEE TRANSACTIONS ON GEOSCIENCE AND REMOTE SENSING, 2024, 62
  • [40] Image fusion based on visual salient features and the cross-contrast
    Adu, Jianhua
    Xie, Shenghua
    Gan, Jianhong
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2016, 40 : 218 - 224