Exploring refined dual visual features cross-combination for image captioning

被引:0
|
作者
Hu, Junbo [1 ,2 ]
Li, Zhixin [1 ,2 ]
Su, Qiang [1 ,2 ]
Tang, Zhenjun [1 ,2 ]
Ma, Huifang [3 ]
机构
[1] Guangxi Normal Univ, Key Lab Educ Blockchain & Intelligent Technol, Minist Educ, Guilin 541004, Peoples R China
[2] Guangxi Normal Univ, Guangxi Key Lab Multisource Informat Min & Secur, Guilin 541004, Peoples R China
[3] Northwest Normal Univ, Coll Comp Sci & Engn, Lanzhou 730070, Peoples R China
基金
中国国家自然科学基金;
关键词
Image captioning; Cross Combination; Contrastive Language-Image Pre-Training; Reinforcement learning; VIDEO;
D O I
10.1016/j.neunet.2024.106710
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
For current image caption tasks used to encode region features and grid features Transformer-based encoders have become commonplace, because of their multi-head self-attention mechanism, the encoder can better capture the relationship between different regions in the image and contextual information. However, stacking Transformer blocks necessitates quadratic computation through self-attention to visual features, not only resulting in the computation of numerous redundant features but also significantly increasing computational overhead. This paper presents a novel Distilled Cross-Combination Transformer (DCCT) network. Technically, we first introduce a distillation cascade fusion encoder (DCFE), where a probabilistic sparse self-attention layer is used to filter out some redundant and distracting features that affect attention focus, aiming to obtain more refined visual features and enhance encoding efficiency. Next, we develop a parallel cross-fusion attention module (PCFA) that fully exploits the complementarity and correlation between grid and region features to better fuse the encoded dual visual features. Extensive experiments conducted on the MSCOCO dataset demonstrate that our proposed DCCT method achieves outstanding performance, rivaling current state-of-the-art approaches.
引用
收藏
页数:13
相关论文
共 50 条
  • [41] Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
    Kuo, Chia-Wen
    Kira, Zsolt
    2022 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR 2022), 2022, : 17948 - 17958
  • [42] Improved medical image modality classification using a combination of visual and textual features
    Dimitrovski, Ivica
    Kocev, Dragi
    Kitanovski, Ivan
    Loskovska, Suzana
    Dzeroski, Saso
    COMPUTERIZED MEDICAL IMAGING AND GRAPHICS, 2015, 39 : 14 - 26
  • [43] TraVL: Transferring Pre-trained Visual-Linguistic Models for Cross-Lingual Image Captioning
    Zhang, Zhebin
    Lu, Peng
    Jiang, Dawei
    Chen, Gang
    WEB AND BIG DATA, PT II, APWEB-WAIM 2022, 2023, 13422 : 341 - 355
  • [44] A novel method of image categorization and retrieval based on the combination of visual and semantic features
    Wang, T
    Zhang, JF
    PROCEEDINGS OF 2005 INTERNATIONAL CONFERENCE ON MACHINE LEARNING AND CYBERNETICS, VOLS 1-9, 2005, : 5279 - 5283
  • [45] An effective use of adaptive combination of visual features to retrieve image semantics from a hierarchical image database
    Pandey, Shreelekha
    Khanna, Pritee
    Yokota, Haruo
    JOURNAL OF VISUAL COMMUNICATION AND IMAGE REPRESENTATION, 2015, 30 : 136 - 152
  • [46] Combining textual and visual features for cross-language medical image retrieval
    Cheng, Pei-Cheng
    Chien, Been-Chian
    Ke, Hao-Ren
    Yang, Wei-Pang
    ACCESSING MULTILINGUAL INFORMATION REPOSITORIES, 2006, 4022 : 712 - 723
  • [47] Visual Features with Semantic Combination Using Bayesian Network for a More Effective Image Retrieval
    Barrat, Sabine
    Tabbone, Salvatore
    19TH INTERNATIONAL CONFERENCE ON PATTERN RECOGNITION, VOLS 1-6, 2008, : 892 - 895
  • [48] iFind: An image retrieval system with relevance feedback based on the combination of semantics and visual features
    Zhu, Xing-Quan
    Zhang, Hong-Jiang
    Liu, Wen-Yin
    Wu, Li-De
    Jisuanji Xuebao/Chinese Journal of Computers, 2002, 25 (07): : 681 - 688
  • [49] Efficient Image Fusion using Visual Salient Features and Cross-Contrast with Edge Weakening Guided Image Filter
    Kaur, Harmanpreet
    Kaur, Navleen
    2017 FOURTH INTERNATIONAL CONFERENCE ON IMAGE INFORMATION PROCESSING (ICIIP), 2017, : 56 - 61
  • [50] Synergistic Enhancements of Ultrasound Image Contrast With a Combination of Phase Aberration Correction and Dual Apodization With Cross-Correlation
    Shin, Junseob
    Yen, Jesse T.
    IEEE TRANSACTIONS ON ULTRASONICS FERROELECTRICS AND FREQUENCY CONTROL, 2012, 59 (09) : 2089 - 2101