Multi-level information fusion Transformer with background filter for fine-grained image recognition

被引:0
|
作者
Yu, Ying [1 ,2 ]
Wang, Jinghui [2 ]
Pedrycz, Witold [3 ]
Miao, Duoqian [4 ]
Qian, Jin [2 ]
机构
[1] East China Jiaotong Univ, State Key Lab Performance Monitoring & Protecting, Nanchang 330013, Jiangxi, Peoples R China
[2] East China Jiaotong Univ, Sch Software, Nanchang 330013, Jiangxi, Peoples R China
[3] Univ Alberta, Dept Elect & Comp Engn, Edmonton, AB T6G 2G7, Canada
[4] Tongji Univ, Sch Elect & Informat Engn, Shanghai, Peoples R China
基金
中国国家自然科学基金;
关键词
Fine-grained image recognition; Vision Transformer; Multi-level information; Information fusion;
D O I
10.1007/s10489-024-05584-x
中图分类号
TP18 [人工智能理论];
学科分类号
081104 ; 0812 ; 0835 ; 1405 ;
摘要
Compared to traditional image recognition, Fine-Grained Image Recognition (FGIR) faces significant challenges due to the subtle distinctions among different categories and the notable variances within the same category. Furthermore, the complexity of backgrounds and the extraction of discriminative features limited to small local regions further exacerbate the difficulty. Recently, several studies have demonstrated the effectiveness of the Vision Transformer (ViT) in FGIR. However, these investigations have frequently overlooked critical information embedded within class tokens across different layers, while also neglecting the subtle local details hidden within patch tokens. To address these issues and enhance FGIR performance, we introduce a novel ViT-based network architecture MIFBF. The proposed model builds upon ViT by incorporating three modules: Complementary Class Tokens Combination module (CCTC), Patches Information Integration module (PII), and Attention Cropping Module (ACM). The CCTC module integrates multi-layer class tokens to capture complementary information, thereby enhancing the model's representational capacity. The PII module delves into the rich local details encoded in patch tokens to improve classification accuracy. The ACM module generates regions of interest based on ViT's self-attention weights and effectively filters background noise, thereby directing the model's attention to the most relevant image areas. Experiments conducted on three different datasets validate the effectiveness of the proposed model, yielding state-of-the-art results and highlighting its superiority in FGIR tasks.
引用
收藏
页码:8108 / 8119
页数:12
相关论文
共 50 条
  • [1] Multi-level network based on transformer encoder for fine-grained image–text matching
    Lei Yang
    Yong Feng
    Mingliang Zhou
    Xiancai Xiong
    Yongheng Wang
    Baohua Qiang
    [J]. Multimedia Systems, 2023, 29 : 1981 - 1994
  • [2] From coarse to fine: multi-level feature fusion network for fine-grained image retrieval
    Wang, Shijie
    Wang, Zhihui
    Wang, Ning
    Wang, Hong
    Li, Haojie
    [J]. MULTIMEDIA SYSTEMS, 2022, 28 (04) : 1515 - 1528
  • [3] From coarse to fine: multi-level feature fusion network for fine-grained image retrieval
    Shijie Wang
    Zhihui Wang
    Ning Wang
    Hong Wang
    Haojie Li
    [J]. Multimedia Systems, 2022, 28 : 1515 - 1528
  • [4] MFF-Trans: Multi-level Feature Fusion Transformer for Fine-Grained Visual Classification
    Hang, Qi
    Yan, Xuefeng
    Gong, Lina
    [J]. WEB AND BIG DATA, PT III, APWEB-WAIM 2023, 2024, 14333 : 220 - 234
  • [5] Multi-level network based on transformer encoder for fine-grained image-text matching
    Yang, Lei
    Feng, Yong
    Zhou, Mingliang
    Xiong, Xiancai
    Wang, Yongheng
    Qiang, Baohua
    [J]. MULTIMEDIA SYSTEMS, 2023, 29 (04) : 1981 - 1994
  • [6] Fine-grained Image Caption based on Multi-level Attention
    Yang Zhenyu
    Zhang Jiao
    [J]. PROCEEDINGS OF 2019 IEEE 7TH INTERNATIONAL CONFERENCE ON COMPUTER SCIENCE AND NETWORK TECHNOLOGY (ICCSNT 2019), 2019, : 72 - 78
  • [7] Fine-grained image recognition via trusted multi-granularity information fusion
    Yu, Ying
    Tang, Hong
    Qian, Jin
    Zhu, Zhiliang
    Cai, Zhen
    Lv, Jingqin
    [J]. INTERNATIONAL JOURNAL OF MACHINE LEARNING AND CYBERNETICS, 2023, 14 (04) : 1105 - 1117
  • [8] Fine-grained image recognition via trusted multi-granularity information fusion
    Ying Yu
    Hong Tang
    Jin Qian
    Zhiliang Zhu
    Zhen Cai
    Jingqin Lv
    [J]. International Journal of Machine Learning and Cybernetics, 2023, 14 : 1105 - 1117
  • [9] Hybrid Granularities Transformer for Fine-Grained Image Recognition
    Yu, Ying
    Wang, Jinghui
    [J]. ENTROPY, 2023, 25 (04)
  • [10] Multi-Stage Training with Multi-Level Knowledge Self-Distillation for Fine-Grained Image Recognition
    Yu, Ying
    Wei, Wei
    Tang, Hong
    Qian, Jin
    [J]. Jisuanji Yanjiu yu Fazhan/Computer Research and Development, 2023, 60 (08): : 1834 - 1845